The Journal of 


Experimental Education 


- $7.50 A YEAR 


A periodical report of scientific investigations relating to child development, 
curriculum, learning, teaching, supervision, measurements, 


March 1957 


CONTENTS 


PAGE 
Computational Illustrations of the Internal and External Consistency 
Analysis of Examination Responses William ]. Moonan 181 


The Relative Achievement of the Objectives of Elementary School Science 
in Minnesota Schools Jane Johnston 191 


Controlled Experimentation in the Classroom Julian C. Stanley 195 


Recent Techniques for Analyzing Association in Contingency Tables as 
Applied to an Analystical Follow-Up Survey of Education Graduates 
Samuel T. Mayo 203 


The Best Linear Estimate of the Predicted Value and the Standard Error 
of the Estimate Palmer O. Johnson 233 


The “Equating” of Non-Paraliel Tests William H. Angoff 241 


The Comparability of the Bi-Factor and Second-Order Factor Patterns 
John Schmid 249 


An Analysis of Variance of Multiple Measurements on Subjects Classified 
in Unequal Groups of One Dimens‘on 
Raymond O., Collier and Clayton L. Stunkard 255 


PUBLISHED QUARTERLY 


Published by Dembar Publications, Inc., 
Madison 3, Wisconsin. 
Entered as second-class matter October 17, 1938 at the post office at Madison, 
Wisconsin, under the act of March 3, 1879, 


$1.75 A COPY 





EDITORIAL BOARD 


A. 6. Barr, Chairman, Professor of Education, University of Wisconsin, Madison 6, Wis. 


‘Ease ee 
ES Se 


=: ‘ERE 
York ‘10 Livia 


7 


H. H. Peychology, 


Professor of 


Educational Reference, Purdes Usivrs: 
si ndiana. Editorially responsible 
2g Sateen nas mal aporilan, paiiees 


re = 


each 


CONTRIBUTING EDITORS 


Director, Betts Reading Clinic, Haver- 


Miri. Peanepivenie. 
ie J. bg, 5 Seaton, University of 
Fy Ly Fg — 9 Aare ae 


\ Professor of Educational hology, 
1 Br ee be tenn 


Mverty of Caliloraia; Berkeley % Calllersia. 
“Bate College of Weshlagton, Pelimen, ‘Weshington. 
ap ay Rig Rg —y A College Entrance 


lege, Clients” cul New York, New York. 


Meciione A. Davis, Professor Education, George Peabody 
1 for Teachers, Meckvitie Tennessee. 


nek egy riem sre Sun 
c. Tpotenese of Paychology, University of 


ee TPA Ra Se Co 


ees | 


tat sree! Casting eR: 
es 
> ee, cite Maras seen 


eae 


‘Sen Ee Sah 
“hee es ee ac tent? 


Newland, Professor of Education, University of 
Tnltnoin Urbana, Illinois 


7 oe ee oe Crete ot 


Willard C, Olson, Dean, School of Education, Uni 
of Michigan, Ano Arbor, Michigan. . samiad 


fg Ne ne ay 


8. L. Pressey, Educational Psychology, 
"genes Ualteelee Coleen Olle mare 


William Reitz, Associate Professor 
Si epuestion Eaamiacr, nh A ae 


Gh testional Bae Rdvcstionsl Research, The University’ of Dinekene, 
oe ne ee 
. _Metiedl of Rducetlon, Cambridge sar Macscchauarne 


John Schmid, Research and Evaluation Pevehas 
land Air Force Base, San Antonio, T oe Set 


Louis G. Assistant Professor of Counsel 
of Education, Indiana Ualeeeity, 


chology, 
Bloomington, Indiana. 
cal Corporation, New York 1, ork. 
David Geos, 


ttsracementa edge tern Agwner, Usk 


egendidige Uan evehology, 


1 Rpt ati: & New 
"feather cmius University, Mew York Chey 


Teachers 
Herbert A. Footegess of Peyshelegy, Obie State 


University, 


ice E. Troyer 
Reon meee Se. tee crores cane 


ree Peart Pate te 
1b Pine Strest, 
a deere 


York Uai- 
rt mpl ‘ork 


DA, Wace, Sinan 


The Peychologi- 


Boston 


Galveraey, 3 Macsa- 


of Educational 
of Nebraska, 


—— 





Journal of Experimental Education 


Volume XXV 


March, 1957 


COMPUTATIONAL ILLUSTRATIONS OF THE 
INTERNAL AND EXTERNAL CONSISTENCY 
ANALYSIS OF EXAMINATION RESPONSES 


WILLIAM J. MOONAN* 
USN Personnel Research Field Activity 
San Diego, California 


Introduction 


THE FIRST purpose of this paper is to illus- 
trate the computations involved in estimating the 
internal consistency of the responses to items 
of an examination which was given under several 
different experimental conditions with each con- 
dition utilizing different subjects. The second 
purpose is to show how to estimate the internal 
and external consistency of responses to the 
items of an examination which was administered 
to the same subjects at different times. The the- 
oretical developments of these two problems 
were presented in references 5 and 6. 

The coefficient of internal consistency meas- 
ures the intra-class correlation of the responses 
to the items of an examination. If the responses 
possess very little internal consistency, thenfew 
individual differences can be detected among the 
subjects to whom the examination is given. Inci- 
dentally, the coefficient of internal ¢ onsistency 
is a statistic which may be used in comparing two 
or more examinations without regard to the num- 
ber of items of each. A coefficient of external 
consistency is also an intra-class correla- 
tion but it measures the consistency of the re- 
sponses to the items or scores made by the same 
subjects in different administrations. An exam- 
ination which is associated with a small coeffici- 
ent of external consistency would be of 1im ited 
value for identifying individua! differences since 
the scores of an individual would vary nearly as 
much over the repeated administrations as would 
the scores of all individuals in any admini st ra- 
tion. In effect then, a score is not stable rela- 
tive to the time or conditions of the administra- 
tion of the examination and to the scores made 
by other individuals. Computational illustrations 
of the determination of coefficients of internal 
and external consistency are given in the follow- 





ing two sections. 


Simultaneous Examination and Method 


~ Analysis (5) 


The data for this illustration were taken from 
(4) which was concerned with evaluating the ef- 
fects of a high school Solid Geometry prerequi- 
site for students in college engineering drawing 
classes. There were three ‘‘methods’’ (M=3) in- 
volved. These are designated by the type of stu- 
dents who were enrolied in one of three groups. 
The three groups were: 

Experimental Group—Students who entered 
the College of Engineering with a deficiency in 
high school Solid Geometry and who enrolled in 
a college course called Engineering Drawing - 4 
prior to taking Solid Geometry in college. 

Regular Group—Students who entered the Col- 
lege of Engineering without a deficiency in high 
school Solid Geometry and who enrolled in Engin- 
eering Drawing - 4. 

Control Group— Students who entered the Col- 
lege of Engineering with a deficiency in high 
school Solid Geometry and who enrolled in Solid 
Geometry in college prior to enrolling in Engin- 
eering Drawing - 4. 


A five-choice, multiple-choice examination 
consisting of fifty items (I = 50) was givento each 
of the individuals in each of the groups after they 
had completed Engineering Drawing - 4. The 
items of the examination concerned problems 
typically associated with orthographic projection. 
The responses to the items were scored dichoto- 
mously. Correct responses were scored asl, 
otherwise they were scored 0. Total scores were 
equivalent to the total number of correct respon- 
ses. W. G. Cochran (2) has shown that analysis 
of variance procedures are suitable for use with 


*#The opinions expressed are solely those of the author and are in no way official; nor are they to be 
construed as representing those of the U.S. Naval Personnel Research Field Activity or Bureau of Per- 


sonnel. 





Number 3 


182 JOURNAL OF EXPERIMENTAL EDUCATION 


data scored in this manner 

The parametric form of a response toan item 
(i) by a subject (s) in one of the groups (m) is in- 
dicated by 


y(ims) = €& + i (i) + w(m) + o(s) + O(im) + @(is) 


i=1,...,i; m=l,...,M; s=1,...,8(m) 


where € represents the general effect, 1 (i) the 
effect of the ith item, j(m) the effect of the mth 
method, 6(im) the effect of the (im)th interaction 
and qp (is) the effect of the (is)th interaction. Itis 
further assumed that each response is assoc iat- 
ed with a variance, o*, and that the responses 
for a given subject are correlated to the degree 
p(I) which is called the coefficient of inte rnal 
consistency. If the examination consists of 
items which are not mutually intra-correlated to 
the same degree, the estimate of p(I) will bea 
ratio of the average covariance and average var- 
jance. The analysis of variance of the y(ims) al- 
so assumes that all parameters involving an s 
parscript are random variables. 

The data are summarized as sums and sums 
of squares (S.8.) in Part I of Table I. The quan- 
tities involved are designated according to the 
symbols used in Table I of (5). The layout, 
which is basic to the determination of the numer- 
ical values in Part I of Table I, will now be de- 
scribed. Consider, for each of the three groups, 
a rectangle of responses, which in this case are 
1’s and 0’s, whose width corresponds to the num- 
ber of items and whose length corresponds to the 
number of subjects in the group. If, foragiven 
group, we total and record the 1’s for individu 
als across the rows we obtain the total scores. 
If we total the 1’s vertically down the columns 
we obtain the item totals. For each group we 
need to sum and find the sums of squares of the 
responses within the rectangle and to find the 
sum of squares of the item totals and total scores. 
For dichotomously scored (0,1) items, the sums 
of squares of the responses is equal to their sum. 
Additionally, the totals of the same items of each 
group are added together and the sum of the 
squares of these totals (2046044) is found. The 
calculation formulas given in Part II of Table I 
are then evaluated for the data given in Part I. 
The summary of the analysis of variance is giv- 
en at the top of Table II. 

The 95% confidence interval (C.I.) of p(I) 
which is a correlational property of the respon- 
ses to the items and, also, the index of internal 
consistency, p(H), will now be evaluated. The 
quantity p(H) is a property of the entire examin- 
ation since it is the Spearman-Brown function of 
the number of items, I, and p{I), thus p(H) = 
Ip(I)/]. 1 + T-ip(1)]. A small p(H) would indicate 
that compounding the responses of a subject to 
the I items of the examination does not greatly 





(Vol. 25 


facilitate the detection of individual differences. 
The information given by the estimate of p(H) is 
of the same type as that provided by ‘‘split-half’’ 
and other similar types of reliability coefficients 
if the Spearman-Brown formula has been used in 
conjunction with them. In these cases intra - 
classcorrelations should be used instead of prod- 
uct-moment correlations since the Spearman- 
Brown formula assumes that the variances of the 
subjects’ scores are equal. The procedure given 
here has the advantage of being independent of any 
type of ‘‘split’’ made on the items to form the sub- 
tests. In some cases an unfortunate ‘‘split’’ may 
lead to a misleading index of internal consistency. 
To get the C.I.’s, we consult a variance ratio 
table and find the two tabulated values, F[ U] = 
F{ 345, 16905;.975] = 1.17 and F{ L] = 1/F{ 16905, 
345;.975] = 1/1.20 = .833. From the variance 
column of Table II the statistic F[o] = .9330/.1822 
= 5.12 is determined. Using equation 13 of (5) 
the interval is calculated to be “ 


(1) §.12-.833  _ 5.12 - 1.17 
512+ 40.633) > OY > By a0.17) 


.09 > pl) > .06, 


where I-1 = 49. Since the C.I. does not include 

zero, the hypothesis that p(I) = 0 (or p(H) = 0) is 

rejected at the .05 level of confidence. The point 
estimate of p(I) is found by substituting 1.00 for 
1.17 on the right side of (1). Thereby, r(I)=. 08. 
The 95% C.1. for p(H) is 

(2) 1-.833/5.12 


> p(H) > 1-1.17/5.12 


.84 > p(H) > .77 


The point estimate of p(H) is also found by substi- 
tuting 1.00 for 1.17 but this is done now on the 
right side of (2). The quantity r(H) is found to be 
.80. 

The coefficient of internal consistency, r(I‘ = 
.08, for these data is quite low and an exam ina- 
tion at least as long as 50 items of the same kind 
is needed if the examination is to be considered 
efficient for identifying individual differences. 
This is indicated by the moderate value r(H)=.80. 
The standard error of measurement (S. E. M.), 
V1o"| 1- p(I)] , expressed in terms of total score 
units s} estimated by multiplying the variance 
of E[ IS] by I = 50 and then taking the square root 
of that product. Thus the S. E. M. =¥ 5 1833) 
= 3.02. Also the standard deviation of the sub- 
jects’ rery is found by multiplying the variance 
of E[ S] by I = 50 and taking the square root of 
that product. The S.D. = F50( 8350) = 6.83. Sep- 
arate evaluations of the r(I)’s, r(H)’s, S.E.M.'s 
and S. D.’s for each group have also been made 
and are presented at the bottom of Table Il. It 
may be seen that they exhibit remarkable agree- 





March, 1957) 


TABLE I 


SUMS, SUMS OF SQUARES AND ANALYSIS OF VARIANCE CALCULATIONS FOR 
THE DATA TAKEN FROM REFERENCE (4) 


Sums and Sums of Squares 





Group 


Description Symbol Experimental Regular Control Total 





Subjects S(m) S(1) = 51 “$(2) 274 S8(3) = 23 S = 348 

Sum of Responses oE y(ims) 1216 7426 700 9342 
S. S. of Responses re y*(ims) 1216 7426 700 9342 
S. S. of Item Totals r| . y(ims)] * 36390 1297838 11268 2046044° 


S. S. of Total Scores L[ x y(ims)] * 31460 213534 22658 267652 
8 i 





This number corresponds to the numerator of A given below. 
Analysis of Variance Calculations 
A=2[ =z y(ims)]*/S = 2046044/348 = 5879. 4368 

i ms 


LE Ly(ims)|*/S(m) = 36390 1297838 11268 
mi s —< i + 57 + 54 5940.0775 


z{ EE y(ims)|*/1S(m) = (1216)* — — (7426)* 700)" 5031. 1693 
= oe (50)(51) (50)(274) (50)(23) —"" 


D = £2) Y y(ims)]*/1 = 267652/50 = 5353. 0400 
ms i 


G = [| ZZz ylims)]*/IS = (9342)* /(50)(348) = 5015. 6876 
ims 


T = ZZZ y*(ims) = 7426 + 1216 + 700 = 9342. 0000 
ims 








JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE II 


ANALYSIS OF VARIANCE OF THE y(ims) AND SUMMARY STATISTICS BY GROUPS FOR THE DATA 
TAKEN FROM REFERENCE (4) 


1. Analysis of Variance of the y(ims) 





Source of Degrees of 
Variation Freedom Sum of Squares Variance F 


H[ 1]; 1(i) = 0 I-1 = 49 A-G = 863.7492 17. 6275 96. 75** 





Hj IM] ;@(im) = 0 I-1M-1 = — 98 B-A-C+G = 45.1590 . 4608 2.53** 


E| Is} I-1 S-M = 16905 T-B-D+C = 3080.0518 . 1822 





Intra-Subject Total I-1 s = 17052 T-D = 3988. 9600 


H{ 0]; B= 0 = 5015. 6876 





H[ M];i(m) = 0 M-' 15.4817 


E s] ; = 321.8707 





Inter-Subject Total = = 5353. 0400 





Grand Total Is = 17400 = 9342. 0000 





** Hypothesis is rejected at least at the .01 level of significance. 


II. Summary Statistics by Groups 








Group 





Statistic Experimental . Regular Control Composite 





Items 50 50 50 
Subjects 51 274 23 
Means 23. 27. 30. 
r(I) 

r(H) 

S.E.M. 

8.D. (Scores) 

Mean Item Diff. 

S.D. (Item Diff. ) 








March, 1957) 


ment from group to group. 

It is interesting to note from Table II that 
there exists some interaction between the items 
and the groups (F = 2.53**). This means that 
some items are more difficult for some groups 
than for others. This tends to complicate a 
straightforward interpretation of the item and 
method effects. The identification of the inte r- 
acting items will not be undertaken here. How- 
ever, scrutiny of the data leads toa judgment 
that the interaction effects are small compared 
to the amounts of group differences. There 
seems to be little doubt that a difference in the 
difficulties of the items has been detected, (F = 
96.75**), and that the average difficulty, 
26. 84/50 = .54, seems quite satisfactory al- 
though the standard deviation of the difficulties, 
v iT 65757348 = .23, is quite large. This same 
standard deviation prevails approximately for all 
groups although the average difficulties vary ac - 
cording to the group means (. 48, .54, and .61). 
The difficulty values of the items for each group 
are quite variable although the average difficulty 
is about . 50 which is a value often recommended 
for psychometric examinations. 

The hypothesis of equality of the group effects, 
Hi M]; p(m) = 0, was rejected, (F =8.30**). The 
group means are given at the bottom of Table II. 
Duncan’s multiple range test (a = .05) shows that 
each of the group means differs significantly 
from every other group mean. According to the 
author of (4) the approximate three point and six 
point differentials which exist between the group 
means represents a practical as well as a statis- 
tical difference. The evidence thus indicates 
that having had a course in Solid Geometry is 
beneficial to those taking Engineering Drawing —4. 


The Determination of External and Internal 
Consistency (6) 








The illustrations of the calculations which 
would be involved in the evaluation of the coeffi- 
cients of external and internal consistency, ac- 
cording to (6) will now be described. For the ex- 
ample, data collected for the purpose of develop- 
ing and scoring a proficiency test for a Tactical 
Range Recorder (TRR) operator (1) will be ana- 
lyzed. The TRR is an instrument used, with 
other equipments, in making anti-submarine at- 
tacks and is part of the SONAR gear on JU. S. 
Navy ships. 

The examination analyzed in this example con- 
sisted of twenty-five items (I = 25) which were 
associated with the determination of ‘‘Range’’. 
Range is a measure of the distance a submarine 
is located from the SONAR gear and is one, 





MOONAN 


among other kinds of information indicated by 
the TRR. The proficiency examination consist- 
ed of recorded signals of submarines operating 
under sea conditions. These signals were re - 
corded on phono-discs and used to activate twelve 
TRR trainers which were operated by 104 sub- 
jects (S = 104) who were learning to become so- 
narmen. The task of the student was to evaluate 
the range for each of the 25 signals (items). For 
reasons given in (1) the responses were scored 
dichotomously.* A score of 1 was given to 
the response if it did not differ by agiven amount 
from the true distance between the submarine 
and the SONAR gear. Otherwise the response 
was scored 0. The examination was adm inis- 
tered twice (A = 2) during the training period of 
the students. These administrations oc curred 
during the 16th and 21st weeks of training and 
will be so designated. 

The linear model of a response to an item (i) 


‘in an administration (a) by one of the subjects(s) 


may be written as: 
y(ias) = p+ .(i) +a (a) + ofS) + cp (ia) + (is) 
+W(as)+ €(ias) 
bak. oon RO Beh s. +2, Behe. Bal, .00¢hObed. 


The parameters are: wp the general effect, } (i) 
the effect of the ith item, a (a) the effect of the 
ath administration, o(s) the effect of the sth stu- 
dent and all two and three factor interactions de- 
fined by their parscripts. For instance q/(ia) is 
the effect of the interaction between the ith item 
and the ath administration. The variance of a 
response is o? and it is assumed that responses 
by the same subject within an administrationare 
intra-correlated to the degree p(I) which is re- 
ferred to as the coefficient of internal consisten- 
cy. Responses by a subject to the same item in 
different administrations are intra-correlated to 
the degree p(A) which is referred to as the coef- 
ficient of external consistency. In the devel op- 
ment of the analysis of variance of the y(ias) it 
was also assumed that all parameters involving 
an 8 parscript are random variables and that re- 
sponses by the same subject to different items in 
different administrations were correlated to the 
degree p = p(A)p(I). It seems reasonable to sup- 
pose that the intra-correlation of responses to 
different items in different administrations 
should be less than the intra-correlation of 
different items in the same administration and 
less than the intra-correlation of the same items 
in different administrations. Therefore, ac- 
cording to the logic of the problem, p shouldbe 


*Although both examples in this paper utilized dichotomously scored responses, the analyses made here do 
not depend upon this method of scoring. 





186 JOURNAL OF EXPERIMENTAL EDUCATION 


less than either p(A) or p(I). 

Different choices of p lead to different forms 
of the analysis of variance of the y(ias). In gen- 
eral, the choice p=p(A) p(I) seems most appro- 
priate for the evaluation of responses to items in 
different administrations. However, this need 
not necessarily be the case for all types of items 
and conditions of administration. Thus, theanal- 
ysis for a general value of p needs to be consid- 
ered. The responses, y(ias), have beenanalyzed 
in (6) in the form of a 3-way classification anal- 
ysis of variance. The expected variance, € (Var), 
of the E[ IAS], E[ 1S], E[ AS], and If S] sources 
of variation, for a general value of p, are given 
below: 


€ (Var.{ IAS])=0? L- pA) -piA)s p}. 
€ (Var.{1S]) =o#! 1- p(1)+A-1 p(A)-A-1 op]. 

& (Var.{ AS}) =02{ 1- p(A)+I-1 p(1)-T-1 p]. 

& (Var.{S]) =of 1+4I- O(I)+AcT p(A)s -1 I-ip]. 


The expected mean squares given in (6) agree 

with those given above if p is set equal to p(A) p 
(I). In the event that a » other than p(A) p(I) is 

assumed, a different method than the one shown 
here or in (6) must be used to estimate p(A), p(I) 
and p. . 

The choice p = p(A), is suitable for the anal- 
ysis of data arising from certain experimental 
conditions. R. O. Collier (3) develops this mod- 
el theoretically, as well as many other analysis 
of variance models which involve correlated ob- 
servations. A numerical illustration is also giv- 
en in (3) for which the assumption p = p(A) is ap- 
propriate. However, the illustration is not con- 
cerned with the type of psychometric analysis 
used here. 

The sums and sums of squares which are bas- 
ic to the analysis of variance of the responses 
are given in Table II]. The literal values A, B, 
etc., of Table III are defined as in Table IV of (6). 
The layout of the responses is similar to that 
used in the previous example. The response rec- 
tangle is again laid out for each administration, 
and the sums of squares of the individual respon- 
ses within the administrations, are calc ulated. 
Finally, in effect, the rectangles are sup erim- 
posed on each other and the respective values 
summed. Then the sums (3385), and sums of 
squares of the totaled responses (5839), item to- 
tals (485803), and score totals (115915), are ob- 
tained. The analysis of variance is presentedat 
the top of Table IV. 

In order to determine the 95% C.I. of the p(I) 
and p(H) a variance ratio table is used to find 
F{ IU] = F{ 103, 2472;.975] = 1.31, and F[IL] = 
1/F{ 2472, 103;.975] = 1/1. 36 = .735. Using in- 
formation from the variance column of Table 
IV, F[1] = 1.1145/. 1898 = 5.87, and the inter- 
val for p(I) is: 





(Vol. 25 


3 5.87 - .735 5.87 - 1.31 
areata > 0 > § 


+ 
.22 >I) >.12, 


where 24 =J-1. The confidence interval does not 
include zero so the hypothesis that (I) = Ofor 
p(H) = 0) is rejected at the .05 level of signifi- 
cance. The point estimate of p(I) is found by sub- 
stituting 1.00 for 1.31 on the right side of (3). 
Thus r(I) = .16. The index of internal consis- 
tency, p(H), has a 95% C.I. given by 


(4) 1-.735/5.87 > p(H) > 1- 1.31/5.87 
.87 >p(H) > .77 


with the value of r(H) = .83 when we replace1.31 
by 1.00 in (4). 

To obtain a 95% C.1. for p(A), the coefficient 
of external consistency, we determine F[{ AU ] 
= F{ 103, 103;.975] = 1.48, F{AL] = 1/F/103,103; 
.975| = 1/1.48 = .676, and F{ A] = 1.1145/.8941 
=1.25. In this special case where A = 2, F{ AL] 
=1/F{ AU]. With these numbers the interval is 
calculated as 


(5) 1.25 - .676 


+ 


1.25 - 1.4 
20 + ; 


> p(A) 


.30 > p(A) > -.08 


where A-1=1. This confidence interval in- 
cludes zero so that we accept the hypothesis that 
p(A) = 0 at the .05 level of significance. The 
point estimate of p(A) is found by substituting 
1.00 for 1. 48 on the right of (5). The quantity 
r(A) = .11 is quite small. The responses to the 
twenty-five items evidently produce scores which 
are relatively inconsistent between the 16th and 
21st weeks of training. It was indicated in (6) that 
for the special case of A = 2, which is true for 
this problem, the product-moment correlation of 
the total scores in each administration would be 
arithmetically equal or larger than r(A). This 
correlation was evaluated to be .11 which, to the 
accuracy carried, is identical with r(A). Sever- 
al factors may account for the low coefficient of 
external consistency. One of them may be the 
scoring system used and another, particularly 
emphasized in (1), is the variance ascribed to 
the TRR equipment which was used in the study. 
The equipment effects were not estimated al- 
though an attempt was made to calibrate the in- 
struments. 

It is also possible to obtain an index of extern- 
al consistency from the analysis of variance table. 
This index was not described in (6) but 
is related to p(A) by the following form- 
ula: 





March, 1957) 


TABLE Il 


SUMS, SUMS OF SQUARES AND ANALYSIS OF VARIANCE CALCULATIONS 
FOR THE DATA TAKEN FROM REFERENCE (1) 


Sums and Sums of Squares 





Administration 


Description Symbol 16th Week 21st Week Total 





Sum of Responses EL y(ias) 1651 3385 
8 


S. S. of Responses xz y*(ias) 1651 5839* 
is 

S. S. of Item Totals p> > y(ias)] * 117095 126406 485803* 
is 


S. S. of Total Scores Z[ = y(ias)}* 28467 31826 115915* 
s i 





These numbers correspond to the numerators of E, A, andC, respectively. 


Analysis of Variance Calculations 


A= Z| Mh y(ias)]*/AS = 485803/(2)(104) = 2335. 5913 

B= Z| a y(ias)]*/IS = { (1651)* + (1734)*] /(25)(104) = 2204. 8296 
C= Z| 1. y(ias)]*/IA = 115915/(25)(2) = 2318. 3000 

D= z2| \ y(ias)]*/S = [117095 + 126406] /104 = 2341. 3558 

E= 22 2 y(ias)]*/A = 5839/2 = 2919. 5000 

F = zz z y(ias)]*/1 = [ 28467 + 31826] /25 = 2411. 7200 

G [Zzp y(ias)]*/LAS = (3385)* /(25)(2)(104) = 2203. 5048 


T = ZL y*(ias) = 1651 + 1734 = 3385. 0000 
las 








JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE IV 


ANALYSIS OF VARIANCE OF THE y(ias) AND SUMMARY STATISTICS BY WEEKS OF TRAINING FOR 
THE DATA TAKEN FROM REFERENCE (1) 


I. Analysis of Variance of the y(ias) 





Source of Degrees of 
Variation Freedom Sum of Squares Variance 





Hj 1A}; q(ia) = 0 [-i A- = 24 D-A-B+G 4.4397 . 1850 


E| IAS} -1 S-1 = 2472 T-D-E-F+ 
A+B+C-G 367. 6403 . 1487 





Hl 1]; .(i) = 0 = A-G = 132.0865 5. 5036 


E| Is] = E-A-C+G 469.1135 . 1898 





Hj A];a (a) = 0 ' ; : 1. 3248 1. 3248 


E| AS] : = 92.0952 





Hj M|;4 = 0 ; = 2203. 5048 


E| $s} 5 = 10: <G = 114.7952 





Total } = 3385. 0000 





** Hypothesis rejected at least at the .01 level of significance. 


II. Summary Statistics by Weeks of Training 








Composite 


Statistic 16th Week 21st Week 





Items 25 25 25 
Subjects 104 104 208 
Mean Score 15. 16. 16. 
r(1) 

r(H) 

r(A) 

r(B) 

S.E.M. 

S. D. (Scores) 
Mean (Item Diff. ) 
S. D. (Item Diff. ) 








March, 1957) 


A p(A) 
1+A-1 p(A) 


(¢) e(B) = 


The relationship of p(A) to p(B) corresponds to 
the relationship of p(I) to p(H). The quantity 
r(B), which estimates p(B), would be useful in 
evaluating scores combined over the administra- 
tions for each subject. A large r(B) would indi- 
cate that compounding the scores over the admin- 
istrations is effective for detecting individual dif- 
ferences between the subjects. 

The 95% C.1. for p(B) is given by the wide 
interval 


(7) 1 - .676/1.25 > p(B) >1 - 1.48/1.25 
.46 > p(B) > -.18 


The point estimate of r(B) is obtainedfrom the 
right side of (7) by replacing 1.48 by 1.00. In 
this case r(B) = . 20 which agrees with the sam- 
ple estimate form of (6), r(B) = 2(.11)/[1+1(.1)) 
= .22/1.11 = .20. The estimate.of r(B) is low 
and the hypothesis that p(B) = 0 is accepted at 
the .05 level of significance. Thus the compos- 
ite scores are of very limited value for detec t- 
ing individual differences. 

The analysis of variance table shows that the 
hypothesis of no administration effects with a 
one-sided alternative hypothesis was accepted 
(F = 1.48). The estimated average gain of the 
students over the five week training period was 
only about 1 test point, 16.67-15.88=.79. This 
fact indicates that on the average, learning dur- 
ing the period was meager for the students. Tab- 
le IV also indicates that there was little item - 
administration interaction, (F = 1.24), and that 
the items differed considerably in difficulty (F = 
29.00**). The average difficulty in the 16th 
week was 15.88/25 = . 64 and in the 21st week 
it was 16.67/25 = .67. The composite average 
difficulty was 16.27/25 = .65. These numbers 
reveal that, on the average, the items were rel- 
atively easy for the students, since the average 
difficulty is greater than .50. Tightening the 
tolerance used in scoring the responses would 
remedy this. The standard deviation of the com- 
posite difficulties was V5. 5036/2068 = .16. For 
the separate administrations the item difficulty 
standard deviations were .18 and .15. 

The standard error of measurement 

1 - p(I)|, for a total score obtained on a 
single administration may be found by noting that 
the expected value of the variance of E[ IS} is 
o*{1 - p(I)][ 1+ A-1p(A)]. Since the estimated 
value of p(A) is .11, A-1 = 1, and the variance 
of E[ 1S] from Table IV is .1898, the quantity 
.1898/1.11 = .1710 is computed. To obtain this 
standard error of measurement (S.E.M.), .1710 
is multipled by the number of items, I = 25, and 
the square root of the resulting number is taken. 





MOONAN 189 


Thus, ¥25(.1710) = 2.07. The S.E.M.’s as esti- 
mated from data from each of the separate admin- 
istrations is shown in Part II of Table IV. These 
composite scores, thatis, the total score obtained 
from adding the separate totals obtained by each 
subject on the two administrations also has a 
standard error of measurement. To find this 
number, the variance of E[ AS] is multiplied by 
IA = (252), and the square root is taken. There- 
fore .8941) = 6.69. This number is 
large relative to the standard deviation of the com- 
posite scores, Vv (25(2) (1.1145) = 7.46, and indi- 
cates at once that the estimate of p(A) should be 
small. The standard deviation of scores between 
administrations is almost as large as the stand- 
ard deviation of the composite scores. The esti- 
mate of the standard deviation of the total score 
of a single administration is found by dividing the 
variance of E[ S] by 1 + A-1 r(A) = 1.11, multiply 
ing this quotient by I = 25 and taking the square 
root. Thus, the S. D. = ¥(25) (i. 1145)/1.11<5.01. 
The S.D.’s which were estimated separately 
from data of each of the administrations are 
shown at the bottom of Table IV. 

Estimates of p(1), e(H), p(A), and p(B) may 
be obtained by using another F{ 1} and Fi A} cal- 
culated from the variance column of Table IV. 
When F{I] = .8941/. 1487 = 6.01, r(I) =.17, r(H) 
= .83 and these numbers are very like the r(I) = 
.16 and r(H) = .82, which were previously ob- 
tained. Also F{ A] can be evaluated as . 1898/ 
.1487 = 1.28, and this gives r(A) = .12 and r(B) 
= .22, which are close to the other r(A) =.1l and 
r(B) = . 20. 

In lieu of the evidence presented, it appears 
that an examination consisting of the 25 range 
items which is administered to SONAR students 
during their 16th and 21st weeks of training, per- 
mits the detection of individual differences to a 
moderate degree, r(H) = .82, on a single admin- 
istration. The coefficient of internal consis - 
tency, r(I) = .16, although numerically small, 
actually is quite large in comparison to other ex- 
amination data which have been analyzed by the 
method. The consistency of responses to the 
same items, or equivalently the consistency of 
the scores between the two administrations is, 
however, very low, r(A) = .11, and this value 
was not statistically significant at the .05 
level of significance. The two sets of responses 
are internally consistent to a favorable degree 
within sets. They are unfavorably externally con- 
sistent. This condition means that although indi- 
vidual differences can be detected to a moderate 
degree, for a given administration because of the 
length of the examination, the nature of these 
differences is not consistent over the time per- 
iod of five weeks. Of course the analysis can - 
not account for this unusual circumstance, but 
further research needs to be undertaken to deter- 
mine possible causes for it. Repeating this anal- 








190 JOURNAL OF EXPERIMENTAL EDUCATION 


ysis for another sample under the same condi- 
tions has not been possible as yet. When this 
can be done, perhaps the scores for the 18th 
week of training can be collected andthe effects 
of the TRR’s used in the study can be controlled. 

Totaling the scores for each subject over the 
two administrations does not facilitate the detec - 
tion of individual differences since the scores 
made on each administration, being almost ex- 
ternally inconsistent, will total toa set which 
have a variance not much greater than the stand- 
ard error of measurement which would be used 
to detect individual differences in the composite 
scores. This example is unique in thatit shows 
that it is possible to obtain high internal consis- 
tency and low external consistency of responses 
to items of homogeneous material from subjects 
who take the same examination at two different 
times under similar conditions. Admittedly, the 
administrations were quite far apart in respect 
to the usual test-retest situtation, but neverthe- 
less, the results were quite unusual. 


Summary 


The arithmetic operations involved in estimat- 
ing coefficients of external and internal consis- 
tency of the responses to items from an examin- 
ation have been illustrated. Two kinds of anal- 
yses were involved. One concerned the evalua- 
tion of the internal consistency of the responses 
in a single experimental situation wherein three 
methods (groups), were being compared. This 
corresponded to an analysis givenin (5), and 
used data taken from (4). The other analysis 
concerned the evaluation of both internal and ex- 
ternal consistency coefficients and indices from 
responses given by students to the items on an 
examination which was administered on two dif- 
ferent occasions. Reference (6) provided the ba- 
sis for the analysis and (1) supplied the data. 

The analyses of variance which were used in 
these two examples provide an efficient and con- 





(Vol. 25 


venient means of determining the usual statistics 
necessary to make a psychometric analysis. The 
coefficients and indices of external and internal 
consistency are estimated easily by point and by 
interval. Also the standard deviations of the 
scores, the standard errors of measurement and 
certain item statistics are readily calculated. 
Additionally, tests of hypotheses regarding the 
parameters of the response model are easily 
made. 


REFERENCES 


. Batterton, R. L. The Development of a Pro- 
ficiency Test for the TRR Operator, Report 
No. 80 (San Diego, California: USN Person- 
nel Research Field Activity, 1955) 

. Cochran, W. G. ‘‘The Comparison of Per - 
centages in Matched Samples,’’ Biomet- 
rika, XXXVII (1950), p. 256. = 

. Collier, R. O. Experimental Designs in 
Which the Observations Are Assumed to be 
Correlated, Unpublished Ph.D. Dis serta- 
tion, University of Minnesota, Minneap- 
olis, Minnesota, 1956. 

. Clausen, J. A. The Effects of an Entrance 
Requirement on Selected Engineering Sub- 
jects, Unpublished Ph. D. Dissertation, Un- 
iversity of Minnesota, Minneapolis, Minne- 
sota, 1955. 

. Moonan, W. J. ‘‘Simultaneous Exam ination 
and Method Analysis by Variance Algebra,’’ 


Journal of Experimental Education, XXIII 
(1955), p. 154 


. Moonan, W. J. ‘‘An Analysis of Variance 
Method for Determining the External and 
Internal Consistency of an Examination,’’ 
Journal of Experimental Education, XXIV 
(1956), p. 239. 


























THE RELATIVE ACHIEVEMENT OF THE OB- 
JECTIVES OF ELEMENTARY SCHOOL 
SCIENCE IN MINNESOTA SCHOOLS 


JANE JOHNSTON 
Moorhead State Teachers College 
Moorhead, Minnesota 


TO WHAT extent are the objectives of elemen- 
tary school science being achieved in Minnesota 
schools? What pupil, teacher, and teaching -sit- 
uation factors contribute to the achievement of 
these objectives? What are the implications for 
the education of elementary school teachers? 
These problems were considered in a recent an- 
alytical survey of elementary school science in 
Minnesota. 1* 

For various parts of the survey the sampling 
unit was the school system (used only in secur- 
ing preliminary data), the fifth-grade teacher, 
or the fifth-grade class (a cluster sample). 

A brief preliminary questionnaire dealing 
with elementary school science teaching fac ili- 
ties and procedures was sent to 478 superintend - 
ents of public graded elementary school systems 
in Minnesota, 1952. 

A proportionate stratified random sampling 
of 87 fifth-grade teachers from the same schools 
was secured in the following manner: 


1. Schools were divided into four strata ac - 
cording to the number of teachers in their ele- 
mentary schools—over fifty teachers (Group I), 
twenty-one through fifty teachers (Group II), 
eleven through 20 teachers (Group III), and one 
through ten teachers (Group IV). 

2. Group I, the smallest group, contained 
eleven schools. It was decided to include two 
schools (18.2 percent) from this group, and the 
same percentage of schools from each of the 
other three groups. 

3. The names of the school systems were 
then listed alphabetically for each stratum, and 
were numbered in that order. Appropriate ran- 
dom number tables were used to select from this 
listing two schools from Group I, seven from 
Group II, twelve from Group III, and sixty-six 
from Group IV. 

A twenty-two item questionnaire dealing with 
teaching experience, colleges attended and 
courses taken, materials and facilities avai} - 
able for elementary school science teaching, 
methods used in teaching elementary school sci - 
ence, and time and emphasis placed on science 
in the classroom was sent to the eighty-seven 
fifth-grade teachers participating in the survey. 


#All footnotes will be found at end of article. 





The substantial agreement between answers 
to similar questions on the superintendents’ and 
teachers’ questionnaires, and the fact that com- 
plete questionnaire returns were secured from 
over 98 percent of the superintendents and 95 
percent of the teachers, provided checks on the 
reliability and completeness of the information 
provided. 

A log of science activities in their class- 
rooms was kept by a sub-sample of thirty of the 
eighty-seven participating teachers. The teach- 
ers in this sample were selected by random sam 
pling methods (random numbers) and were as 
signed at random to three different time periods 
of ten days each. Information from these logs 
provided a check on information in the two ques 
tionnaires, and furnished additional information 
on the science activities and teaching methods 
that were being used. 

According to questionnaire and log replies 
(1954), emphasis on science in elementary class- 
rooms was less than that given to social studies 
or reading, and more than that given to music or 
art; the typical science class was thirty minutes 
long, and the average time spent on science per 
week was under two hours. About sixty percent 
of the schools had some purchased science equip 
ment. Although most of the teachers reported 
that equal emphasis was given to biological and 
physical science, their logs indicated that biolog- 
ical science topics were considered in a ratio of 
3:1 to physical science topics. Text reading and 
discussion was the ‘‘most-used’’ science teach 
ing method, while field trips and laboratory work 
were the ‘‘least-used.’’ The typical teacher had 
taken college courses in biology and health, had 
not taken courses in physical science or the 
teaching of science, and did not have a college 
degree. 

Information on fifth-grade pupils taught by the 
eighty-seven participating teachers was secured 
by means of a science test (administered in De- 
cember 1953, and again in May 1954), an intelli- 
gence test, and a science reading test. One hun- 
dred percent returns were obtained. 

The 103-item science test, constructed by the 
writer, was designed to test knowledge of sci- 
ence facts, relationships, and generalizations. 





192 JOURNAL OF EXPERIMENTAL EDUCATION 


Subject matter included in the test was deter- 
mined by (a) examining the science section of 
the Minnesota Curriculum Bulletin Number 7 
A Guide for Teaching Science and Conservation?, 
and including some questions on each phase of 
science mentioned; (b) including additional bio- 
logical science questions, because that section 
of the Curriculum Bulletin is limited; (c) omit- 
ting questions on health, conservation, and safe- 
ty because these topics have traditionally been 
included in social studies units, andit was de- 
sired to find out how much science the pupils 
were learning; and (d) including some questions 
from each of the important areas of science—as 
determined from examination of children’s text- 
books, curriculum bulletins, textbooks on sci- 
ence teaching, and research recommendations— 
such areas as plants and animals, the various 
forms of energy, transportation and communica- 
tion, simple chemistry, astronomy, earth sc i- 
ence, simple machines, and size and time rela- 
tionships. 

Before its administration, the test was s ub- 
mitted for criticism to six expert teachers of sci- 
ence, and a trial form of the test was adm inis- 
tered to two fifth-grade and two sixth-grade clas- 
ses, in order to detect confusing directions or 
questions, and to estimate the time needed to ad- 
minister the test. Teachers who administered 
the first tests in December did not know that the 
same test was to be repeated the following May, 
and all tests were returned to the writer after 
the first administration to the eighty-seven fifth- 
grade classes. Test results on both the pretest 
and retest were secured from all eighty-sevenof 
the participating fifth grades. The reliability of 
the science test was calculated by the split-test 
method, obtaining the maximum likelihood esti- 
mate.” The reliability of the pretest was found 
to be .91, and of the retest, . 94. 

Statistical techniques used in analyzing test 
and questionnaire results included the following: 


1. The ‘‘t’’ test was used to determine the 
significance of the differences between means 
on the science pretest and retest. 4 

2. Cluster sample statistical treatment was 
used in determining the sample estimate of the 
population mean and variance of the mean on the 
science tests. In the testing program, the fifth- 
grade class to be tested was selected by random 
sampling methods. Thus each fifth-grade class 
in the population stratum had an equ’! chance of 
being selected. This fifth-grade class was, how- 
ever, a cluster sample; each individual pupil in 
the whole population stratum of fifth-grade pupils 
in Minnesota did not have an equal chance of be- 
ing selected. In calculating the estimated mean 
and variance for the population from this sample, 
therefore, cluster sample statistical analysis 
was used, in which each mean was weighted by 





(Vol. 25 


the size of that particular fifth-gradeclass. The 
calculation formula for the population mean was: 


m 
> MR 
bas 


rz Xj 
N 





1 
fi = 
zy nj 

i=l 

where njXj is the number of individuals in the ith 
cluster multiplied by the mean of the ith cluster, 
and the estimated population mean is the sum of 
all of these quantities calculated for each of the 
eighty-seven groups and divided by the total num- 
ber of individuals in the sample. 

Results of the above calculations showed the 
sample estimate of the population mean to be 
59.54 for the science pretest and 64. 67 for the 
retest, with a mean gain of 5.13 points. When 
the ‘‘t’’ test of significance was applied to this 
gain, it was found to be significant at the . 01 lev- 
el. 

3. The analysis of variance statistical treat- 
ment was used to compare science test scores 
received by pupils in three different 1.Q. groups. 
The following results were found: (a) Of the total 
group of 1850 pupils, 147 were found to have 
I.Q.’s of 121 plus; 1153 to have I.Q.’s of 95-115; 
and 207 to have I.Q.’s of below 90. Science test 
sub-score means on physical science and gener- 
alization questions were found to be significantly 
different (.01 level) for the three I.Q. groups. 
(b) Mean gains of the three I.Q. groups between 
the pretest and retest were not found to be signif 
icantly different. 

4. The analysis of variance and analysis of 
variance and covariance statistical tec hniques 
were used to relate mean gains of fifth-grade 
classes between the science pretest and retest 
to each of the following teacher or teaching-situ- 
ation factors: teachers’ years in college; num- 
ber of science areas in which teacher took col- 
lege courses; teachers’ years of teaching exper- 
ience; whether or not the teacher took a college 
course in science teaching methods; college 
physical science courses taken by the teacher; 
use of science instruments for observing and 
measuring; number of science books in the class 
room other than textbooks; time and emphasis 
placed on science; and the relative use of each 
of the following in science teaching: 


Visual aids 

Group planning of science projects 
Group performing of science projects 
Research reading 
Teacher-demonstration and explanation 
Individual laboratory work 


A statistically significant relationship between 
the teacher-factor being considered and class 
gains on the science tests was found in one 





Zz 
2) 
= 
n 
& 
° 
5 


March, 1957) 


sivaf arom 10 9g = ¢ ‘seed GI-IT = § ‘Sead OI-9 = ¢ ‘srBPeA G-Z= 7 ‘rPAA T=] ‘apoD 


stseuodAH TMNes 





€esO “LITT Ts 9SST LE oest zs. 't 9F6S “ZIP- 
£969 “ZI SOTO E00 'T 6L L8EI “LbT TO61 “S29 1 $LO00 "68>- 


10*<d<So° wefley c6b F P9E0 “LS 82L0 PIT (4 L786 ‘9ZT 8ZIb OL 


68E2 “FIZ TI THIOL 
Z6PI OST 'T sdnor5 UTA 


9680 “F9 sdnoi5 ueeajeg 








*+P2}S9L | arenbs sarenbs ye 2XZ (Ax)s 2X5 Axz 
uteW yo wing 
sen[eA peisn(py 





2h UONPIITA 
yO a0.In0g 











$L00 '68>- bOZS “OBL “SZ 





pLLE EII- 9FE0 “8ZL ‘9 
OITT “S6Z- 00S0 “LIS ‘ZI 


1610 08 - 8SEb SES ‘9 





AxZ AXS 





LNVLSNOOD LSALINd HLIM ‘FJONITNIdXT ONIHOVAL S.YAFHOVAL 


GNV LSALFY-LSALIYd JONAIOS NAAMLAG FJODNVHOD ‘FONVIUVAOSD ANV FONVIYVA dO SISATVNYV 


I FIGVL 





194 JOURNAL OF EXPERIMENTAL EDUCATION 


case. The relationship between class gains and 

the teacher’s number of years of teaching exper- 
ience was found to be significant at the .05 level 

(analysis of variance and covariance). Informa- 
tion on this analysis will be found in Table I. 

The fact remains that most of the pupils did 
gain significantly on test scores in the five-month 
interval between the tests, but some fifth-grade 
groups gained more than others, some did not 
gain significantly, and some lost rather than 
gained. These differences in amounts of gain 
are not therefore due to the lapse of time and in- 
creasing maturity of the pupils alone; neither do 
they appear to be significantly related to pupil 
I.Q. or to any of the teacher or teaching-situa- 
ation factors which were investigated, with the 
exception of the teacher’s years of teaching ex- 
perience. 

Further analysis could be made to determine 
whether any combination of teacher-factors 
showed a significant relationship to pupil gains, 
or whether reading test results showed a signif- 
icant relationship to science test results. These 
analyses are not reported here. 

The results of the statistical analysis of ques- 
tionnaire and test data may indicate a true lack 
of relationship between the teacher-factors 
checked and the pupils’ science achievement, or 
they may reflect the fact that the general level 
of teacher preparation in science was |i mited; 
therefore, the differences between teachers and 
teaching practices were not marked enough to ef- 
fect significant differences in class achievement. 

The study reveals some information which 
has implications for the preparation of elemen- 
tary school teachers. 

In the group of 1850 fifth-grade pupils on 
whom complete test data were available, 147 
were found to have I.Q.’s of ove? 120. It can 
thus be estimated that approximately 1732 Min- 
nesota fifth-grade pupils (1954), exclusive of 
those in Minneapolis, St. Paul, and Duluth, had 
I.Q.’s of 121 plus. With the current need for cap 
able adult scientists, and the knowledge that they 
are generally from this high 1I.Q. group, the in- 
dications from this study that many elementary 
school teachers are failing to (a) differentiate 
classroom work sufficiently so that the talented 
achieve in proportion to their abilities, and (b) 
sampie adequately the various science areas 
and activities so that vocational and avocational 
interests are aroused early, have important im- 
plications. Certainly the tendency to emphasize 
biological science at the expense of physical sci- 
ence, the limited physical science background of 
the typical teacher in this survey, the predomin- 
ant use of the textbook-discussion method in 
teaching science, and the almost complete ab - 





(Vol. 25 


sence of individual laboratory work and true ex- 
periments in the classrooms, indicate that the 
potential scientists are not being provided for. 

The objectives of elementary school science, 
moreover, do not apply only to those who intend 
to pursue scientific careers; they apply to all 
children. It is unlikely that Minnesota children 
in general are profiting any more from the pres- 
ent science program than are the high-ability 
children. Research is needed to determine the 
best ways of providing prospective and in-ser- 
vice teachers with the ‘‘know-how’’ and emotion- 
al push necessary to enable them to provide a 
dynamic science program for each one of their 
pupils, potential nuclear physicist and bird- 
watcher alike. 

This study made possible a detailed descrip- 
tion of the teachers, science teaching practices, 
and science teaching facilities in Minnesota fifth- 
grade classrooms in 1954. It made available a 
record of the I.Q.’s and science test scores of a 
representative sampling of Minnesota fifth-grade 
classes, from which it was possible to general- 
ize for the state as a whole. It provided informa- 
tion on the extent to which the objectives of ele- 
mentary school science, as measured by gains 
in a science achievement test, are being achieved 
in the elementary schools of Minnesota, the lim- 
itations in the present program, and suggestions 
for further research on the best methods of pre- 
paring teachers to further the achievement of the 
objectives of elementary school science with 
their pupils. 


FOOTNOTES 


. Jane Johnston. The Relative Achievement 
of the Objectives of Elementary School Sci- 
ence in a Representative Sampling of Min- 
nesota Schools, unpublished Ph. D. thesis, 
University of Minnesota, 1956. 














. A Guide for Instruction in Science and C on- 
servation, Curriculum Bulletin Number 7 
(St Paul, Minnesota: Minnesota State De- 
partment of Education, 1951). 





. Palmer O. Johnson. Statistical Methods in 
Research (New York: Prentice Hall, Inc., 
1949), pp. 126-127. 


. Ibid., p. 78. 





. Palmer O. Johnson. Cluster Sampling, un- 
published material, University of Minne- 
sota. 








CONTROLLED EXPERIMENTATION 
IN THE CLASSROOM’ 


JULIAN C. STANLEY 
University of Wisconsin 


MY THESIS is that much more controlled ex- 
perimentation under classroom conditions should 
be done. ‘‘Experimentation’’ as used here is not 
meant to include uncontrolled experientation, im- 
portant though such experiental analysis undoubt- 
edly is. Nor do I include status studies unless 
they are imbedded in an experimentally con- 
trolled design. 

There are two general types of controlled ex- 
perimentation in classrooms. One of these is 
the ‘‘methods’’ study in which two or more ways 
of doing something are compared in an unbiased 
fashion. The other is so-called ‘‘fundamental 
research, ’’ whose intent is to derive general 
principles applicable beyond the immediate situ- 
ation in which they are found. We needfar more 
of both types of research. 


The Speech Study 





Consider the following example of the many 
methodological decisions teachers must make 
virtually every day. Recently I hadthe pleasure 
of helping several high-school teachers set up 
an experiment to determine the effectiveness of 
observing seventh graders unobtrusively in the 
natural context of the classroom and then telling 
them in writing specifically which speech faults 
they needed to correct, and how to correct them. 
Logically and intuitively, it was apparent to the 
speech teachers that this procedure should be 
quite effective, far better than just instructing 
students in the speech class or having no formal 
instruction at all. They were scientifically 
minded, however, and therefore decided to sub- 
ject their hunch to empirical test. The design 
of their study should serve to illuminate sev- 
eral aspects of careful experimentation. 

They might have taken the students in each 
speech class at the beginning of the semester 
and divided them randomly into two groups, one 
of whom would be ‘‘followed’’ in other classes 
during the semester and notified of speech im - 
perfections. The other group would not be ‘‘fol- 
lowed.’’ If the extra attention was effective in 
improving speech behavior, then the ‘‘followed’’ 
group should be the better speakers and oral 
readers by the end of the semester. Having both 
groups in intermixed order speak and read oral- 
ly for judges who do not know to which group a 





given individual belongs is a means of securing 
unbiased ratings that can then be separated for 
the two groups and compared. If the average rat- 
ing of the ‘‘followed’’ group is significantly larg- 
er than the average rating of the ‘‘not followed’’ 
group, then we would conclude that the extra at- 
tention is having some positive influence upon 
speaking. Of course, the influence still might 
not be great enough to warrant the expenditure of 
time and money, but at least ‘‘following’’ would 
be shown to be having some effect. 

In this sort of experiment it is easy to test 
additional hypotheses without using any more stu- 
dents than are necessary for the simple compar- 
ison of the ‘‘followed’’ and ‘‘not-followed’’ group. 
One might, for instance, hypothesize that ‘‘fol- 
lowing’’ would work better for girls thanfor boys. 
By dividing the students first by sex and then ran- 
domly by followed vs. non-followed group (see 
Table I), we can answer two new questions: (1) 
Do boys differ from girls on the ratings? (2) Are 
there differences between the sexes with respect 
to the value of ‘‘following’’? 

Going still further, we might include in the 
experiment both those persons who are taking 
speech and those taking English, dividing all the 
speech students into two sex groups and subdivid- 
ing each of these into the ‘‘followed’’ vs. the 
‘‘non-followed. '’ Similarly, we would separate 
the English students into boys and girls, then di- 
viding the boys into ‘‘followed’’ vs. ‘‘non-fol- 
lowed’’ and the girls likewise. See Table II. 

This arrangement results in eight groups: 
boys taking speech who are followed, boys taking 
speech who are not followed, girls taking speech 
who are followed, girls taking speech who are 
not followed, boys taking English who are fol- 
lowed, boys taking English who are not followed, 
girls taking English who are followed, and girls 
taking English who are not followed. It enables 
us to answer seven different questions, contrast- 
ed with only one for the two-group design and 
three for the four sex-method groups, and yet 
requires no more students than the two- group 
setup. The seven questions that can be answered 
are: Do boys differ from girls with respect to 
speaking or oral reading ability? Do those per- 
sons taking speech differ from those taking Eng- 
lish? Does the ‘‘followed’’ group differ from 
the ‘‘non-followed’’ group? Are there interac~- 


*Revised version of a read to the Educational Research Section of the Wisconsin Fducation Associa- 
tion, November , 195. ; 














196 


tions of sex, subject, and method, sex and sub- 
ject, sex and method, or subject and method? !* 

The four- and eight-group arrangements 
shown in Tables I and I are both ‘‘factorial de- 
signs.’’ We have two levels of eachfactor—sex, 
subject, and method—combined in all possible 
ways with each other to yield eight different sex- 
subject-method groups in Table I. The huge 
advantages of factorial designs are that they al- 
low several variables to be manipulated si mul- 
taneously and then to be evaluated independently 
of each other and that they make the testing of 
interactions possible. Frequently heard objec- 
tions to the artificiality of controlled experimen- 
tation involving the manipulation of a single var- 
iable in a sort of vacuum, with other factors 
held as constant as possible, are not applicable 
to the factorial design. It permits us to approx- 
imate a natural setting in our experimentation, 
while at the same time testing several possible 
factors with nearly as much efficiency as if we 
were considering only a single variable such as 
‘‘followed’’ vs. ‘‘non-followed.’’ As yet, this 
design is seldom employed in educational re- 
search, though it is of wide utility. 

In Table I we have the outline of a methods 
experiment actually conducted. Results were 
all negative. The boys and girls spoke and 
read equally well, the speech classes did not 
do appreciably better than the English c lasses, 
the ‘‘followed’’ students were no better speakers 
or readers at the end of the semester than the 
‘‘non-followed’’ group, and none of the four in- 
teractions was significant. Therefore, within 
the limitations of our experimental procedure 
we are forced to conclude at least tentatively 
that ‘‘following’’ students is a waste of time and 
money. 

It is essential to remember the conditions 
under which we experimented, however. Hav- 
ing used only seventh graders in acertain rath- 
er atypical type of school, we cannot generalize 
our conclusions to other grades or other schools. 
The method of ‘‘following’’ we used is the only 
one for which we have information; another 
method might be effective. Our findings are lim- 
ited by the nature of the speech and oral reading 
ratings secured, and by the competencies and 
idiosyncrasies of the 10 particular raters in- 
volved. Furthermore, the rating situation it- 
self may be too formal and unlifelike to elicit 
natural speech behavior from the students. Fin- 
ally, we gathered our data at the end of the first 
semester and consequently cannot make state- 
ments applicable to the whole year. Never the- 
less, there is certainly no positive evidence in 
this carefully controlled study that being fol- 
lowed is worthwhile. Henceforth, the burden of 






#All footnotes will be found at end of article. 


JOURNAL OF EXPERIMENTAL EDUCATION 








(Vol. 25 


proof is upon proponents of ‘‘following’’ rather 
than upon its opponents. They may wantto launch 
new experimentation to test further hypotheses, 
such as the following: another well defined meth- 
od of following is effective; following works bet- 
ter in other grades than it does in the seventh; 
following is more effective in other schools than 
here; other measures of speaking and oral read- 
ing ability reveal the effectiveness of following. 

Where possible, we try to design methods ex- 
periments so as to answer questions more gener- 
al than the purely local problem under considera- 
tion. Relating the study to a conceptual scheme 
or theory from which results can be deduced in 
advance enables us to test important aspects of 
the theory while we are getting our specific infor- 
mation. How well this can be done depends large- 
ly upon how advanced conceptually the field in 
which we are working is. If speech theorists 
have devised definite frameworks to unify their 
field, then we can test deductions based upon 
these. For example, if one theory emphasizes 
changing the individual’s self-concept in a cer- 
tain way, we can use a method of ‘‘follo wing’’ 
that is congruent with it. Or, if another theorist 
emphasizes giving the student successful experi- 
ences in a speaking situation, we can try to ac- 
complish that. Because of the numerous ways to 
assist or tutor a student, we must select the 
ones for trial on the best possible logical and em - 
pirical grounds, rather than just picking at ran- 
dom. 


The Opinion-Change Study 





The methods experiment of Table II constitut- 
ed action research, since it was designed by ac- 
tual teachers of speech to determine whether or 
not they should continue following their students. 
Its results were of immediate practical value to 
these teachers in their school. Some research 
is less related to action and more to general spec - 
ulation. A study conducted by Herbert Klaus - 
meier and myself2 may illustrate this sem i - 
theoretical type of experiment. 

We were interested in whether or not role play- 
ing changes extreme opinions. Our subjects were 
145 beginning graduate students, 62 in my ad- 
vanced educational psychology class and 83 in 
his. Thirteen days before the beginning of the 
experiment itself we obtained ratings of attitude 
toward world government on a four-question 
opinionnaire. Then for my experiment I chose 
low scorers (‘‘isolationists’’) and high scorers 
(‘‘internationalists’’), leaving the middle scorers 
to rate secretly the quality of the role playing. 
One-fourth of the extreme scorers were assigned 
to play a role opposite to their initial opinions, 







March, 1957) STANLEY 


TABLE I 


A FOUR-GROUP (SEX-METHOD) FACTORIAL DESIGN* 





Method 





Not Followed Followed 























*There are n students in each of the four groups, or 4n stu - 
dents in the entire experiment. Half of the 2n boys are as- 
signed randomly to the ‘‘not followed’’ group and the other 
half to the "7ollowed’” group. Similarly, half of the 2n girls 
are assigned randomly to the ‘‘not followed’’ group. _ 


TABLE I 


AN EIGHT-GROUP (SEX-SUBJECT-METHOD) FACTORIAL DESIGN* 





Method 





Subject Not Followed Followed 
> 





Speech 





English 





Speech 





English 

















idan 

*Each boy is assigned randomly to one of four groups (speech-not followed, speec h- 
followed, English-not Tollowed. or English-followed). Likewise, each girl is as- 
signed randomly to one of the four groups. Preferably, each of the eight groups will 
contain the same number of students; in the speech study there were six students in 
each of the eight groups. 








198 JOURNAL OF EXPERIMENTAL EDUCATION 


one-fourth play a role congruent with their ini- 
tial opinions, one-fourth to play no role but to 
serve as first-row observers, and one-fourth to 
go to the library before the experiment began 
and work on unrelated material. Thus there 
were four ‘‘role’’ groups: different role, same 
role, observer, and outside control. Eachofthe 
four role groups contained an equal num ber of 
men and women, and each had half of its persons 
whose initial opinions were ‘‘isolationist’’ and 
half who were initially ‘‘internationalist.’’ In all, 
there were 16 role-sex-initial opinion groups, 
as shown in Table III. Seven hypotheses can be 
tested with this design, using change-of-opinion 
scores based upon differences between the first 
opinionnaire and a similar opinionnaire adminis- 
tered the day after the experiment: 


Are the differences among the four roles sig- 
nificant? Do the changes of men differ from 
those of women? Does it make any difference 
with respect to opinion change whether the pe r- 
son was Originally ‘‘isolationist’’ or ‘‘inte rna- 
tionalist’’? Are any of the interactions among 
sex, initial opinion, and role significant? 

Klausmeier’s design was identical with mine, 
except that where I used a ‘‘sex’’ classification 
he used ‘‘age’’: 31 or less vs. 32 or more. Un- 
der our conditions, role playing had no effect up- 
on opinion. We had hypothesized that per sons 
playing a role in line with their initial opinions 
should change little, while persons playing an 
opposite role should shift in the direction of that 
point of view, but this did not happen. Nor did 
the persons playing a compatible role change 
Significantly less than the persons who left the 
room without even observing the role play ing. 
Our only positive finding was a slight tendency 
for the women in my study to get more world- 
minded and the men less world-minded; this in- 
teraction of sex with initial opinions may be at- 
tributable merely to chance. 


Implications 


Since both of the studies I have outlined yield- 
ed essentially negative findings, youmay wonder 
whether these experiments were worth perform- 
ing. Obviously, it is valuable to know that fol- 
lowing students to observe and record their 
speech errors for their information is not effec - 
tive. It is probably of even greater importance 
to learn that confident claims for the op inion- 
modifying influence of role playing do not hold 
up in a controlled experiment.’ Both experi- 
ments highlight the fact that hypotheses must be 
tested empirically as well as logically. Our best 
brain-children may be worthless or downright 
pernicious. Expert opinions, pooled judgments, 
brilliant intuitions, and shrewd hunches are fre- 
quently misleading. Ultimately, they must be 





(Vol. 25 


tested by the careful gathering of evaluative data 
if education is to advance on the basis of sound 
principles. 

We badly need educational experimentation of 
the controlled variety. Too muchofour research 
effort has been directed toward questionnaires, 
statistical compilations, and dead-end corr ela- 
tional studies. We have shied away from design- 
ing and executing studies involving the manipula- 
tion of pertinent variables. Our colleagues inag- 
riculture and the behaviorial sciences such as 
psychology have forged far ahead of us experi- 
mentally. Yet we cannot rely upon psychologists, 
sociologists, and anthropologists to do our exper- 
imentation. They provide much relevant theory 
and many enticing hypotheses for us, but we must 
hoe our own garden; they cannot be expected to 
do this. 


Obstacles to Controlled Experimentation 





In my opinion, one of the major reasons why 
experimentation in education has languished is 
that our graduate schools have been confused by 
a false practitioner versus research distinction. 
This has lead in most schools of education to the 
training of few, if any, persons in educational 
experimentation as such. Whereas almost no 
psychology major can earn a Ph.D. degree ina 
reputable university without prolonged exposu.e 
to experimental psychology, statistics, andmeas- 
urement, we let the great majority of our doctor- 
al candidates get through their dissertations with 
hardly the basic rudiments of training for experi- 
mentation. And training is essential. While it 
is true that many persons who receive suchtrain- 
ing will actually engage in little experimentation, 
untrained individuals are virtually certain to do 
none or to do it ineptly. 

Another reason for the paucity of educational 
experimentation is that university professors in 
schools of education typically do very little them- 
selves. Until they provide the training for exper- 
imentation and set appropriate examples for their 
students, improvement is unlikely. 

Few public school administrators and even 
fewer parents realize the dire necessityfor con- 
tinual experimentation. Some equate experimen- 
tation with vivisection. Others protest against 
the use of control groups, saying that if a given 
method is likely to be of value they want all the 
students to have it, not realizing that in this way 
they make it difficult, if not impossible, ever to 
assess the worth of the method. Perhaps the 
widely publicized tests of Salk’s polio vaccine 
will be a salutary antidote to this attitude, though 
it is doubtful that many persons not already 
aware of the need for controls in experimenta- 
tion will perceive the similarity between the po- 
lio study and, say, the introduction of a new 
method for teaching handwriting. 





A,.IeTIUIIS 


pausisse ar2am uawom ‘sdnoi3 ajo mmo} ayy Jo auo 0} A[WOpUerI peusIsse Sem UPM YOeY “S1a}eI 


O€ 24} BZuNuUNOD jou ‘juauTtJedxs ayy UI ZE JO Te}O} © ‘sdnor3 g] ay} JO YORa UT SUOSIad OM} 213M JIIZULs 


T | T 


| 
| JSTUOTIETOS] 








JSTTBUOTI CuUsIIUT 


| 

- 

| 

| 

| 

i 

| 

" 
~—— 





JSTUONFIOS] 





ISTTBUOT CuI UT 


STANLEY 





jomUu0D I9A139SqO suotuidg Tenruy suotutdo [entu suoturdo 
aprsino MOY-}SIIY 0} a}y1soddo yA JUuanIZUOD Tenruy 














9sTo" 

















«SdNOUD 91 ONIATOANI LNAWIYAdXT NOINIGO-dO-FONVHO GANDA ATIVINOLOVA V 


Tl FTIAVL 


March, 1957) 





200 JOURNAL OF EXPERIMENTAL EDUCATION 


Two essential ingredients of controlled ex - 
perimentation are difficult to incorporate into 
our thinking. One is the imperative need for de- 
signing experiments ‘‘from scratch.’’ Exp eri- 
ments do not ‘‘just grow.’’ They must be 
planned in advance down to the finest detail. The 
final analysis of data must be anticipated and 
worked through abstractly or in a pilot study be- 
fore the experiment itself begins. This preplan- 
ning phase will usually require more time than 
the actual experiment. Once all aspects of the 
design have been thought through and refined as 
much as possible, the experiment must be run 
off exactly as planned, allowing only for unavoid- 
able disasters. We hear a great deal about 
‘flexible experimentation, ’’ but once set in mo- 
tion a controlled experiment is completely rig- 
id and inflexible. Flexibility must be reserved 
for the preplanning and postplanning phases. 
The experimental findings almost invariably 
suggest new hypotheses that can be tested exper- 
imentally. 

The second difficult concept to put over is the 
crucial need for randomization. The very no- 
tion of randomization is hard to grasp, since it 
implies chaos for the single case and a high de- 
gree of order for the group. Suppose that you 
have 30 pupils in a class and that you wantacon- 
trol group of 15 and an experimental group the 
same size. How shall you divide the class into 
halves in order not to give an undeserved lead 
to one group or the other? If there were five 
seats to a row, you might consider taking all stu- 
dents in the first three rows for one group and 
all in the last three rows for the other group. 
Would this procedure yield two groups of equal 
ability for, say, an arithmetic experiment? Not 
in most classes, for seating is hardly likely to 
be random. If mostly girls sit near the front 
of the room and mostly boys near the rear, 
and if one sex surpasses the other in arithmetic 
skill, results of the experiment will be biased 
at the very beginning. 

We might put all 30 names into a hat, each 
on a separate slip, shuffle them thoroughly, and 
then draw out 15. Or, somewhat less defensibly, 
we might alphabetize the 30 names and put the 
odd-numbered persons (1,3,5,..,29) in one 
group and the even-numbered persons (2, 4, 6, 
..., 30) in the other. The best method is to 
number the children from 1 through 30 (or from 
0 through 29) and draw 15 numbers within this 
range from a table of random numbers. In all 
three instances—shuffling slips, alphabet i zing, 
and random numbers—it is essential to decide 
in advance which will be the control group and 
which the experimental. We cannot wait until 
we note into which group our prize pupil happens 
to fall and then call this the experimental group. 
Experiments must be without bias if theyare to 





(Vol. 25 


yield unambiguous conclusions. 

I cannot stress too heavily the needfor get- 
ting fresh data specifically designed to answer 
the questions you pose. Too often a researcher, 
particularly a thesis-driven graduate student, is 
tempted to exhume some moribund material he 
finds stored away in a convenient mausoleum 
and perform a statistical autopsy onit. This 
‘*seeing what the data died of’’ is good clinical 
medicine but rarely worthwhile research. The 
persons who originally collected the data did not 
have the present problem in mind, so it wouldbe 
quite unusual for the data to bear directly on the 
current hypotheses. About all you can get out of 
this post-mortem analysis is a crazy patchwork 
quilt of guarded conclusions and tortuous qualify- 
ing statements. The customary upshot is to say 
that ‘‘more research is needed.’’ Of course it 
is, but not research of this foredoomed variety. 

Almost as culpable is the person who gathers 
mountains of data as he goes along, without any 
particular experimental design, in the hope that 
eventually it will serve to test some hypotheses 
he gets around to formulating. In many instances 
this is a collossal waste of time and energy, 
harking back to the exploratory days of the late 
nineteenth century and the biometrical approach 
of Karl Pearson. Since 1925, and particularly 
since 1935, when R. A. Fisher’s The De- 
sign of Experiments appeared, we have known 
about the inefficiency of this ‘‘sawed-off shotgun’’ 
approach. No longer do we need huge numbers 
of experimental subjects in order to reach valid 
conclusions. Small-sample theory enables us to 
emphasize the rigor of the experimental design 
instead of the size of the sample itself. Ten thou- 
sand subjects in an ill-planned study may yield 
results more equivocal than 40 subjects inacare- 
fully designed experiment. 





Preparation for Classroom Experimentation 





There are many experiments you can do inthe 
classroom. Most of these probably fall within 
the category of action-method studies, but some 
may be of more general import. Remember that 
the generality of your findings is inextricably 
bound up with the formulation of the problem; if 
you strive for generality of application rather 
than viewing your investigation as answering a 
merely local problem, you can make a contribu- 
tion beyond the confines of your school. 

You will want to have the necessary training 
in statistics (particularly the analysis of vari- 
ance) and research methods before trying to de- 
sign an experiment, and/or you will want to 
work closely from the beginning with a special- 
ist who understands these matiers well. Unfor- 


tunately, there are not many such persons avail- 
able, perhaps no more than 100 educationalists 





March, 1957) 


in the whole United States. A possible solution 
is to take statistics and experimental design 
courses in the psychology department of a major 
university. Collaboration is the desirable plan, 
however, even though you do acquire c onsider- 
able technical sophistication. 4 Cooperative ef- 
forts such as the speech study I described earli- 
er benefit all persons involved immensely. 

In any event, it is essential that you design 
your experiment thoroughly before beginning to 
‘*turn the crank’’ and that all collaboration be 
started at the very begining, rather than after 
the data are in. As the statistician must nearly 
always say to the researcher who consults him 
too late, after the experiment has been mi sman- 
aged badly, ‘‘Of all sad words of tongue or pen, 
the saddest are these, it might have been.’’ Or- 
dinarily, the specialist in experimental des ign 
does not know your subject matter field as we11 
as you, but he can point out logical flaws and 
methodological imperfections that might nullify 
all your otherwise commendable efforts. 


Recapitulation 





This paper has been concerned entirely with 
controlled experimentation in the classroom, 
which I believe we have neglected to our great 
detriment. Most decisions about methods have 
been based upon colloquial, anecdotal, or admin- 
istrative considerations rather than experimen- 
tation. Seldom are adequate control groups in- 
corporated into classroom experiments. Thene- 
cessity for long-range experimental design is 
not usually appreciated by teachers and adminis- 
trators. The principle of randomness is often 
misunderstood or ignored in favor of elaborate 
matching, which has several disadvantages. 
Worst of all, few teachers, including those with 
doctoral degrees, get even minimal training for 
modern experimentation. Our professional | it- 
erature is virtually devoid of well controlled ex 
perimental studies in the classroom. Wecontin- 
ue to pool ignorance via conferences, questi on- 
naires, rating scales, opinionnaires, and inef- 
fective correlational studies, all of which are 
valuable for certain purposes but not sufficient 
in themselves. 

If we are to advance beyond the dark ages of 
educational pre-science, we must emulate the 
experimental proficiency and zeal of colleagues 
in other behaviorial sciences. Ours is a distinc- 
tive task; education has its own problems to 
solve. Only by close collaboration between e x- 
perimentally aware persons at all levels in a 
widespread research partnership can we hope to 
make headway against the vast complexities of 
the teaching-learning process. 





STANLEY 


FOOTNOTES 


. If there are no significant interactions, the 


‘‘main effects’’ are additive. Suppose, for ex 
ample, that girls are superior to boys, per- 
sons taking speech are superior to those tak 
ing English, and the ‘‘followed’’ group is su 
perior to the ‘‘non-followed’’ group. Then, if 
interactions are nil, the best speakers will be 
the girls taking speech and being followed 
The worst speakers will be the boys taking 
English and not being followed. However, if, 
despite the general superiority of girls, 
speech, and following, the boys taking English 
and not being followed were the best speakers, 
we would havea triple interaction (sex, speech, 
and method). 

Two-factor interactions may also occur 
Girls taking English might be superior togirls 
taking speech, despite the overall superiority 
of the speech group over the English group 
Also, boys being ‘‘followed’’ might be better 
speakers than the followed girls, despite the 
overall speaking superiority of the girls. Or 
those persons taking English and being fol 
lowed might be better speakers, regardless of 
sex, than those taking speech and being fol 
lowed, even though for the experiment as a 
whole the speech group is superior to the Eng 
lish group. 


. ‘Opinion Constancy After Formal Role 


Playing,’’ Journal of Social Psychology 
(in press). 


3. As in the speech study, one must not overgen 


eralize the results of the Stanley-Klausmeier 
investigations. Definite limitations were set 
by the type and small number of subjects used, 
the formal nature of the role playing situation, 
and the opinion area involved. Janis and King, 
studying ‘‘The Influence of Role Playing on 
Opinion Change,’’ | Journal of Abnormal and 
Social Psychology, XLIX (1954), pp. 211-18] 
with small groups under non-classroom con- 
ditions, found evidence for such change. For 
a recent review article that contains 22 refer 
ences, see J. H. Mann, ‘‘Experimental Eval 
uations of Role Playing,’’ Psychological Bull 
etin, LIT (1956), pp. 227-34 


. Even the experimental-design ‘‘experts’’ in 


education and psychology consult each other 
and the mathematical statisticians frequently 








RECENT TECHNIQUES FOR ANALYZING AS- 
SOCIATION IN CONTINGENCY TABLES AS 
APPLIED TO AN ANALYTICAL FOLLOW-UP 

SURVEY OF EDUCATION GRADUATES: 


SAMUEL T. MAYO 
Loyola University 
Chicago, Illinois 


Introduction 


THE PURPOSE of the present paper is to de- 
scribe some recent techniques for analyzing as- 
sociation in contingency tables and to illustrate 
their application to an important educational 
problem evaluating the human product relative 
to the occupational destination of graduates of a 
teacher -training institution. 

The kinds of statistical techniques used in 
the present investigation are those applicable to 
enumerative data in the form of contingency 
tables, which are classified on the basis of at- 
tributes. The concept of attributes will be con- 
sidered elsewhere in this manuscript. The use 
of contingency tables to analyze categorical 
data is not new, dating back some forty-odd 
years to the work of Karl Pearson. However, 
little fundamental work on the improvement of 
the theory and applications of contingency tables 
seems to have been done until just after World 
War Il. From that time a number of significant 
techniques have been developed by statistical 
workers in the fields of biometrics and ag ri - 
culture which promise to greatly extend the use- 
fulness of the contingency table. It should be 
recognized that there is no reason, in princi - 
ple, why such techniques should not find great 
usefulness in the behavioral sciences as well. 

Chi-square, as usually employed, has been 
traditionally limited to contingency tables clas- . 
sified on the basis of only two attributes at a 
time and consisting ofa ‘‘large’’ sample. Among 
the resulting disadvantages of this more famil- 
iar application of contingency tables are these: 


1. The chi-square test for a contingency ta- 
ble suffers from the weakness that it coversall 
forms of departure from expectation and is cor- 
respondingly insensitive to departures of a spec - 
ified type. 

2. Two-way tables, analyzed piecemeal, ig- 
nore higher-order interactions, which involve 
three or more attributes simultaneously. 

3. ‘‘Large’’ samples are not always available 
in behavioral research. 





Some recently developed techniques consid- 
ered here serve to overcome these weaknesses. 
Among these techniques are: 


1. Approximate tests of significance for contin- 
gency tables in which association is known to 
exist, so that the significance of departures 
of a specified type (e.g., correlation or re- 
gression) may be tested; 


. Approximate tests of significance for complex 
contingency tables involving various hypothe- 
ses among as many as four attributes sim ul- 
taneously (i.e., tests of the significance of 
higher-order interactions); 


. Exact tests of significance for ‘‘small’’ sam- 
ple data, or contingency tables with ‘‘small’’ 
or zero theoretical cell frequencies and clas- 
sified on the basis of two or three attributes. 


The first of these types of techniques, those 
by which one may investigate departures of a par - 
ticular type, utilize scoring schemes for rows 
and columns of the contingency tables. Such tech- 
niques have been described by Cochran (2), Stu- 
art (11), Yates (13), and Williams (12). The tech- 
niques of Yates and of Williams will be described 
and illustrated in the present paper. 

In the second type of technique, or those for 
complex tables for which one may test hypothe- 
ses involving three or more attributes simultan- 
eously, chi-square may be estimated by either 
of two approximate solutions. In one solutionas 
given by Bartlett (1), Norton (9), and Snedecor 
(10), chi-square as a test for higher-order inter- 
action in a 2 x 2 x 2 table or the general 2" form 
can be approximated by solving a higher-order 
equation. Thus, in the 2 x 2 x 2 case, one would 
solve a cubic equation. In the other type of solu- 
tion, chi-square for anr xs xtxu(or four- 
way) contingency table with any number of cate- 
gories per attribute can be estimated by app|li- 
cation of the likelihood ratio criterion as de- 
scribed by Mood (8) and Doi (3). 

In the third type of technique, or those com- 


#The author wishes to acknowledge the help of his advisor, Dr. Palmer 0. Johnson, in the carrying out of 


the investigation upon which this paper is based. 





204 JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


prising exact tests, one considers the exact rel- 
ative frequency of a selected configuration of 
cell frequencies among all configurations de- 
duced from the observed marginal totals, which 
are used as ancillary statistics. The concept 
was first introduced by Fisher (4) for the 2 x 2 
case and has been extended to more complex 
cases by Freeman and Halton (5). 

These techniques will now be described in de- 
tail and illustrated elsewhere in the manuscript 
by the survey data. 


Scoring Techniques in Simple 
‘Contingency Tables 








The significance of association between two 
attributes in a simple (or two-way) contingency 
table can be tested by the usual chi-square test. 
However, chi-square is an overall test of all 
forms of departures from expectation. Under 
the null hypothesis that the two attributes are 
not associated, the expected frequencies in any 
row (or column) are proportional to the margin- 
al frequencies. When one rejects the null hy- 
pothesis of independence and association is as- 
sumed to exist between the attributes, it is of 
practical importance to be able to specify the 
nature of the association. Williams (12) main- 
tains that the interpretation of the data is simp- 
lest if the association can be accounted for in 
terms of the correlation between a single pair 
of variates corresponding to the two attributes. 
In order to make this interpretation, one must 
make the assumption that both attributes are 
based on characters which are either directly 
quantitative or are in the form of gradings which 
can be regarded as having an underlying quanti- 
tative basis. Williams shows that significance 
tests, developed for discriminant analysis and 
for the interpretation of interactions, exact when 
the variates involved are normally distributed, 
may be applied as tests asymptotically exact to 
contingency tables. 

The methods of Yates.(13) and Williams (12) 
differ in the manner of determining the scores 
for rows and columns. In the Yates’ method the 
scores are chosen arbitrarily. Yates has point- 
ed out that any system of scoring may be as- 
signed to each of the classifications. In the Wil- 
liams’ method, on the other hand, scores are 
derived empirically from the data and must sat- 
isfy certain identities. The empirically-derived 
scores serve to maximize the correlation c oe f- 
ficient obtainable from a given set of data. Thus, 
they can be brought into closer concordance with 
the data. 


The Yates’ Method of Scoring Simple Contin- 
gency Tables—The symbolic notation of Yates 
was modified in order to be more consistent with 


that of Williams as well as with the usual statisti- 
cal symbols. The notations used here were as 





follows: 


ni j observed frequency in ith row and jth 
column 

nj. total frequency in the ith row 

n,j total frequency in the jth column 

xj arbitrary score assigned to the ith row 

yj arbitrary score assigned to the jth col- 
umn 

B numerator of the formula for regres- 
sion coefficients of yj on xj andxj ony; 

A denominator of regression formula for 
yj on xj 
denominator of regression formula for 
xj On yj 
regression coefficient of yj on xj 
regression coefficient of xj on yj 
number of rows in contingency table 
number of columns in contingency table 
significance level for a particular t 


The Yates’ symbolism for a 3 x 3 contingency 
table with assumed arbitrary scores being the 
series -1, 0, and +1 for both rows and for col- 
umns is given in Figure 1. 

The quantities A, A', and B are calculated by 
the following expressions: 


B <n. Shes yjnij) - (Z xynj_ )(Z yjn_ j) 
i j 


The two regression coefficients are given by 
Dxy = B/A 
and byx B/A' 
To test whether each regression coefficient is 
significantly different from zero, we first com- 
pute the standard error of each regression coef- 


ficient and then apply the t test. The standard 
errors of the coefficients are given by 


S.E. of Dxy = VA'7An.. 
and S.E. of Dyx = VA/A'n.. 
and the corresponding t ratios are given by 


te ey 





March, 1957) 


The t’s are not differentiated by subscripts, 
since they will be numerically equal when ap- 
plied to a given set of data. The obtained t’s 
are referred to either the t table with infinite de- 
grees of freedom or to the normal probability 
tables. 

After we have estimated the magnitude of the 
effect of one variate upon the other in the par- 
ent population, we may further inquire as to the 
errors of this estimate. Fiducial limits for the 
total variation of yj accounted for by xj between 
extremes of both variates may be estimated by 
multiplying b and its standard error by (r - 1)= 
(s - 1) first, then computing 


b+ tg (8.E. of b) 


An alternative test of significance of the as- 
sociation of row and column scores is given as 
a chi-square with 1 degree of freedom by 


x* = B’n_/AA' 


However, if the t’s for testing the significance 
of regression coefficients have already beencal- 
culated, we can obtain this same value of c hi - 
square by merely squaring t. 

The Williams’ Method of Scoring Simple Con- 
tingency Tables—The symbolic notations for the 
Williams method are the same as for the Yates’ 
method, with the following exceptions: 





Xj denotes empirically-derived score 
for the ith row 

yj denotes empirically-derived score 
for the jth column 

lxy denotes the correlation between the 
row scores and column scores 


The matrix notations will be explained in the 
text as they appear. Further explanation of ma- 
trix symbolism and algebraic rules may be found 
in texts by Gulliksen (6:160), Mood (8:170-176) 
and by other authors. 

As in the Yates’ method, the first step in the 
Williams’ method is to make an overall test of 
association by the usual X* test. This is done 
in order to insure that association does, in fact, 
exist, since either the Yates’ or Williams’ meth- 
od is primarily an aid to interpretation of a con- 
tingency table where association is known to ex- 
ist. 

Solution of the empirical scores proceeds by 
a number of operations with matrices. 

The original r X s contingency table with ele- 
ments njj is transformed to the matrix G with 
elements given by 





MAYO 205 


This amounts to dividing each cell frequency in 
the contingency table by the geometric mean of 
the row total and column total for that partic u- 
lar cell. 

The matrix G is transformed to matrix T by 
the matrix multiplication 


T = GG' 


This formula says that matrix T is obtained by 

multiplying G by its transpose. The transpose 
of a matrix is obtained by making its rows into 
columns and its columns into rows. Each ele- 
ment in the T matrix is obtained by following the 
rules for multiplying two matrices. Mood (8) 

has stated the definition for matrix multiplica- 
tion in general as follows: 


The element in the ith row and jth column 
of the product matrix is obtained by mul 
tiplying the elements of the ith row of the 
left-hand matrix by the corresponding ele 
ments of the jth column of the right-hand 
matrix and adding the results. 


In order to calculate the scores for rows and 
columns of the contingency table which would 
maximize the correlation between the two scores, 
it is necessary to solve for the second largest 
latent root of the matrix T. We shall designate 
this root by the symbol X. This root is equal to 
the square of the maximized correlation coef 
ficient, and the latent vector of this root gives 
the required set of scores for the rows. Col- 
umn scores are calculated from row scores 
by an appropriate formula, which will be given 
subsequently. 

To solve for X we obtain the characteristic 
equation of matrix T by first subtracting the un- 
known quantity A from each of the three terms 
in the main diagonal of T, then by treating the 
resulting matrix as a determinant, and finally 
by expanding the determinant to yield the char- 
acteristic equation. 

When the numerical value of } has been de- 
termined to the desired degree of accuracy, the 
row scores are found for a set of solutions bj to 
the set of simultaneous equations 


(T-rxI] {mh} = 0 


and finding a set of scores 
xj = kbj 


proportional to the bj’s so that the restrictions 





206 JOURNAL OF EXPERIMENTAL EDUCATION 


are satisfied. 
The column scores are calculated by the re- 
lation 


yj* 4 Nj jXi/E xy, 5 
and the obtained scores must satisfy the restric- 
tions 


yn 
J 


j° 0 


and yn y, = 
j “v's 
A check on the accuracy of the scores is giv- 
en by the formula for the correlation in terms 
of row and column scores as given by 


rxy = ; 2 nyjpxqy;/n 


which should be numerically equivalent to Vd . 

Williams showed that the formula for testing 
the significance of the association of row andcol- 
umn scores which was given by Yates as 


X* = B’n_ /AA' 
is equivalent to testing 


on 2 
= 8 ay 


as a chi-square with one degree of freedom. 
Since ryy is also a correlation coefficient, anal- 
ternative test is to treat 


2 
(n.,~ 2)ryy 
ais 1-r? 
xy 
as an F with l andn_ -2 degrees of freedom. 


Higher-Order Interactions in Complex 
Contingency Tables 








Whereas in a two-way contingency table, we 
test the hypothesis that there is no association 
between the two attributes of classification, which 


is equivalent to saying that there is no first-order - 


interaction between the attributes—in a complex 
table, we may test the hypothesis that there is no 
higher-order interaction among the several attri- 
butes. Several different kinds of higher-order 
interactions may exist, and later these will be 
considered when the likelihood ratio criterion is 
applied. First, however, the 2 x 2 x 2case will 
be considered. 

Second-Order Interaction in the 2 x 2 x 2Con- 
tingency Table—The method of testing sec ond- 











(Vol. 25 


order interaction for the 2 x 2 x 2 contingency 
table is due to a method suggested by Bartlett (1), 
developed by Norton (9), and illustrated by Sned- 
ecor (10). The symbolism of these authors is 
followed in the present paper. 

Consider, first, the two fourfold tables with 
cell contents a, b, c, danda', d', c', d', re- 
spectively, as shown in Figure 2, where the no- 
tations are given as follows: 

A,B,C denote the three dichoto- 
mized attributes 

i,j,k denote numbers of the cat- 
egories of the three attri- 
butes, respectively, which 
take on values of 1 or 2 
denote the cell contents of 
the first 2x2 table of attri- 
bute C 

denote the cell contents of 
the second 2 x 2 table of 
attribute C 

denote the parameter val- 
ues of the probabilities of 
a case from sample N fall- 
ing in the indicated cell 
denote the parameter val- 
ues of the probabilities of 
a case from sampleN' fall- 
ing in the indicated cell 


Pa> Ph» Pe» Pd 


Pat ’ Pp' ’ Por ’ Pq 


Each of the 2 x 2 tables which is classified by the 
two attributes A and B may be examined for in- 
dependence separately by the usual chi-square 
test. In these two tables the null hypotheses test- 
ed are 


PaPd = PpPc 


and Pat Pq = Pp Per 


respectively. 
In the theory being followed, the sample esti- 
mate of first-order interaction is the ratio 


a/b: c/d=a/c: b/d = ad/be 


Aside from the continuity adjustment and from 
errors of sampling, if the null hypothesis is true 
(X? = 0), then ad/be = 1. In this sense the hy- 
pothesis tested by chi-square is that in the popu- 
lation the interaction ratio is unity. 

In the 2x 2x2 table, the interaction in one 2x2 
table may be different from that in the other. 
The ratio of these first-order interactions is 
known as the second-order interaction. The 
sample estimate of the second-order interaction 
is 

Re aetara aa ec = adb'c! /a’ d’ be 





March, 1957) 


FIGURE 1 


SYMBOLISM FOR YATES’ METHOD WITH AN EXAMPLE OF NUMERICAL SCORES 


- 2 
(Score) 
lium. Score 
sym. Score 


Nume Syme 





1 
+ x, 


Moy 


n 


31 











=> 


Total Score >” Xn, 1 ms X4Nyo 
> AiMay Y X_N « 
n n 


Mean Score 





el oc 


FIGURE 2 


SYMBOLISM FOR A 2 x 2 x 2 CONTINGENCY TABLE 





























208 JOURNAL OF EXPERIMENTAL EDUCATION 


Before proceeding to the presentation of the 
test of significance for second-order inte rac- 
tion, it will be well to explain the concept of de- 
viations in 2 x 2 tables. Ina 2 x 2 table the de- 
viation (or difference between observed and the- 
oretical frequency) is the same in absolute mag- 
nitude for all four cells. Let this deviation be 
symbolized by 


where f, denotes the observed cell frequency 
and f, the theoretical or expected cell frequency. 
The algebraic sign of x is positive for one pair 
of diagonal cells and negative for the other pair, 
so that 


2 
> & Xyy =O 
_, 


It is easy to construct a table of expected num- 
bers in which the interaction ratio is 1, as fol- 
lows: 


a- b+x 
c+x d-x 


The quantityxforagiven 2 x 2 table could be 
computed by solving the equation 


(a - x)(d - x) 
(b+ x)le + x) 


which may also be expressed in the form 
(a - x)(d - x) = (b + x)(c + x) 


By analogy it may be shown that there isa de- 
viation x common to all eight cells of a pair of 
2 x 2 tables considered jointly as a 2 x 2 x 2. 
The analogous equation to solve is the cubic 


(a - x)(d - x)(c' - x)(b' - x) = 
(c + x)(b + x)(a' + x)(d' + x) 


This equation is easily remembered by a rule 
of thumb which indicates that in the left-hand 
member the letters follow a diagonal downward 
in the first 2 x 2 table, then across and upward 
in the second. Methods for approximating the 
real root of this cubic equation may be found in 
any college algebra book. Essentially, they con- 
sist of trying various values of x until two are 
found between which the left-hand member of 
the equation changes sign. 

When x has been calculated toa sufficient 
number of decimal places, the adjusted chi- 
square with one degree of freedom for second- 
order interaction is given by the formula: 





(Vol. 25 


eg Se oe ee ae oe 
(Ix a-x* d-x* c'-x* BY-x* cox* Bex 


1 1 
* ai-x* a= ) 


If the chi-square for second-order interaction is 
not significant, the pair of 2 x 2 tables may be 
combined into a single 2 x 2 table and tested for 
first-order interaction. 

Likelihood Ratio Criterion for a Four- Way 
Contingency Table—In the foregoing section the 
2x22, a special case of the three-way contin- 
gency table, was considered. A general means 
for testing several kinds of hypotheses fora four- 
way table by the likelihood ratio criterion will 
now be considered. It should be noted that the 
method is also applicable to three-way tables. 

Consider a four-way contingency table class- 
ified by the attributes A, B, C, and D as shown 
symbolically in Figure 3, where the notations 
are given as follows: 








A,B,C,D denote the four attributes 
Categories of attributes are denoted by sub- 
scripts as follows: 


A--(i 
B--(j 
C--(k 
D--(h 


_% ,r) 
as « < oe 
as a 
a ,u) 


’ 


1 
1 
1 
1 


neHwueia 


, 
’ 
’ 
’ 


Nijkh denotes the frequency in the cell 
common to the ith row, jth col- 
umn, kth category of attribute C 
and hth category of attribute 
D 


Marginal totals were denoted by replacing the 
summed index by a dot, e.g., 
u 


Nijk. Nijkh 


Nij.. 


p> 
h 
; 
h 


t 

% Mijkh 

: @ Ff 
>= ni; 
k j i ijkh 


Probabilities for particular category values are 
expressed by appropriate subscripts, e.g., 


Pijkh denotes the probability of an indi- 
vidual’s being in the ijkhth cell 

Pi »PjP Ph denote the probabilities of an in- 
dividual’s being in the categories 
i, j, k, andh, respectively 

x denotes the likelihood ratio 


Six kinds of null hypotheses can be tested in 





March, 1957) 


FIGURE 3 


SYMBOLISM FOR A FOUR-WAY CONTINGENCY TABLE 





















































210 JOURNAL OF EXPERIMENTAL EDUCATION 


the four-way contingency table by the likelihood 
ratio. They may be stated verbally as follows: 


1. The four attributes are mutually independent. 


2. Three of the attributes are mutually independ- 
ent. 

. Two of the attributes are independent. 

. One attribute is independent of a classifica- 
tion formed by combining the other three at- 
tributes into a unit. 

. One attribute is independent of a classif ica- 
tion formed by combining two of the other 
attributes into a unit. 

. The combination of two attributes is independ- 
ent of a combination of the other two. 


Each of the six kinds of null hypotheses will 
be considered separately. In addition, each will 
be illustrated by examples of the hypotheses 
stated in terms of probabilities andan example 
of the calculation of the chi-square to test the 


( 1) l0€e@ 
2) 10 e 
3) 10€ 
4) 10€ 
5) 10g 
6) 10 ¢@ 
7) log @ 
8) log e 
9) 10€ 

loge 
10€ 
10 @ 


10 





30258 =D 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 
. 30258 


. 30258 


(Vol. 25 


hypotheses by an approximate method. When 
stating one of the hypotheses in terms of prob- 
abilities, one assumes probabilities of a case be- 
ing in one category or in another category as be- 
ing independent of each other. Then, by a well- 
known theorem of probability for compound 
events as shown by Mood (8:34), the null hypoth- 
esis may be stated by a proposition denoting that 
the compound probability (i.e., the probability 
of a case’s being jointly in a certain pattern of 
categories) is equal to the product of the proba- 
bilities for the separate categories. 

As an intermediate step in estimating chi - 
square for the various hypotheses, we calculate 
the likelihood ratio » in the present method. It 
can be shown for complex contingency tables 
that in large samples the quantity -2 loge » ap- 
proaches the chi-square distribution withappro- 
priate degrees of freedom. All of the terms 
necessary to calculate \ for any hypothesis are 
given by eighteen expressions as follows: 


logy gM... 
logy gn j.. 
1og4 9". .k. 
1logyo"...h 
1081 g"ijkh 
log; 9, jkh 
10€49"i. kh 
1081974. jh 


1084 Mi jk. 


10g 4 gMij. . 


10810". k. 
10819". h 


log; 9", jk. 





March, 1957) 


(14) 106 
(15) loge 
(16) loge 
(17) loge 


(18) loge 


Descriptions of the algebra necessary to test 
each of the six kinds of hypotheses along with 
appropriate examples are presented below. 


a. Are the four attributes mutu- 
ally independent? 








The null hypothesis tested here is 
H: Pijkh = PiPjPkPh 
The likelihood ratio is 
res wt n..k. th. 
(ny) 2" (Em 5°)" Gn. nS) 
er. } k h 





with df = rstu - (r+s+t+u)+3 
and loge > = (1) + (2) + (3) + (4) -(18) - (5) 


Chi-square for testing the above hypothesis is 
calculated by X* = -2 logedX. The same proced- 
ure for calculating each chi-square from the ap- 
propriate \ was followed throughout. 


b. Are three of the attributes 
mutually independent? 








Four hypotheses may be tested here, namely 
those testing the mutual independence of A, B, 
and C; A, B, and D; B, C, and D; or A, C, ani 
D. As an example, the hypothesis for testing 
the mutual independence of A, B, and C is 


H: Pijk = PiPjPk 
The associated contingency table may be thought 


of as a three-way table with cell freque ncies 
Nijk.- The likelihood ratio is 


Pe “we Mag a. -&., 
n n n 
rm. y te » 





Ak. 


(no. : + Mj ' 


MAYO 


2. 30258 x N jh 10819". jh 
jh 

2. 30258 Zn kh 1010". . kh 
kh 

(2. 30258)(n__ , logyon.... 


(2. 30258)(2n_ _ Logyon. ) 


(2. 30258)(3n,.., logygn.... 


with df = rstu - (r+s+t)+2 
and loge X = (1) + (2) + (3) - (17) - (9) 


c. Are two of the attributes 
independent? 





Six hypotheses may be tested here, namely, 
those testing independence of A and B, A and C, 
Aand D, BandC, Band D, andC and D. As an 
example, the hypothesis for testing the independ- 
ence of A and B is 


H: Pij = PyPj 
The likelihood ratio is 
he Bt. 
(inj... )(iIn, 3 
i 
» = : 





nly. 
(n”. a myj.. ) 


with df = (r - 1)(s - 1) 
and loge = (1) + (2) - (16) - (10) 


d. Is one attribute independent of a class- 
“Wir ation formed by combining the 
other three attributes Into a unit? 














Four hypotheses may be tested here, namely, . 
thoeo tesUing the independence of A from (B, C, 
D); ! om (A,C,D); C from (A,B,D); and D 


frozn (A. B,C). As an example, the null hypoth- 
esis for the independence of A from (B,C,D) is 


He Diigh © PiPjkh 


where pjjxh is the probability of an individual’s 
being in {he ijkhth cell, pj the probability of being 
in the ith category of attribute A and py», the 
probability of being jointly in the jth, kth, and hth 
categories of attributes B, C, and D. 





The likelihood ratio is 





JOURNAL OF EXPERIMENTAL EDUCATION 


; Nn, jkh 
Mth NM jkh ) 





; Nijkh 

Min Och ) 

with df = (r - 1)(stu - 1) 

and loge »& = (1) + (6) - (16) - (5) 


e. Is one attribute independent of a class- 
ification formed by combining two 
of the other attributes into a unit? 











Twelve hypotheses may be tested here, 
namely, those testing whether A is independent 
of either Band C, Band D, or C andD; whether 
B is independent of either A andC, A and D, or 
C and D; whether C is independent of either A 
and B, A and D, or B and D; and whether D is 
independent of either A and B, AandC, or B 
and C. As an example, the hypothesis for test- 
ing the independence of A with B and C is 


H: Pijk = PiPjk 


where pjjk is the probability of being in cells 
formed by summing over the h’s and Pjk is the 
probability of being in the joint jth and kth cate- 
gories of attributes B and C. 


The likelihood ratio is 


with df = (r - 1)(st - 1) 
and log, >» = (1) + (13) - (16) - (14) 


f. Is a classification formed by the com- 
bination of two attributes independ- 
ent of a classification formed by 

the combination of the other 
two attributes ? 


Three hypotheses may be tested here, name- 
ly, those testing whether A and B is independ- 
ent of C and D; whether A and C is independent 
of B and D; and whether A and D is independent 
of Band C. As an example, the hypothesis for 
testing the independence of A and B with B and 
C is 


H: Pijkh = PijPkh 


where Pjj is the probability of being in the ith 
and jth categories of attributes A and B jointly 
and pyp the probability of being in the kth and 
hth categories of attributes C and D jointly. 





(Vol. 25 
The likelihood ratio is 


v ni” ; Be R hy 


ijkh 
an hh ) 





with df =,(rs - 1)(tu - 1) 


The Exact Method of Freeman and Halton 





The method of Freeman and Halton (5) is an 
extention of Fisher’s exact method for the 2 x 2 
case to tables with any number of attributes and 
any number of associated categories. It applies 
to any sampled array of numbers, whatever the 
population from which the sampling is effected, 
provided that: (a) either the parent population is 
infinite, or the sampling is done with replace- 
ment of the sampled members; and (b) the sam- 
pling is random. 

Like Fisher’s method, a special case of the 
present method for the 2 x 2 table, Freeman and 
Halton’s method is applicable to tables which 
have theoretical frequencies so small that the 
chi-square test is not appropriate; even tables 
with some zero frequencies can be handled. Ac- 
cording to the authors, the only difficulty in us- 
ing the method is the amount of computational 
labor for dealing with large samples. 

The method in general consists of assuming 
the border totals fixed, considering only relation- 
ships internal to the contingency table, consider- 
ing every possible array of cell frequencies with 
the given border totals, and applying a test of 
significance as follows: (a) all arrays subject to 
the same general conditions as observed (i.e., the 
same border totals); (b) the corresponding a pri- 
ori probabilities are calculated by means of the 
appropriate probability expression; (c) the values 
of the a priori probabilities smaller thanor equal 
to the probability of the observed array are not- 
ed; these are the probabilities of allarrays which 
are a priori as probable as, or less probable 
than the observed array; (d) all probabilities sat- 
isfying the conditions in (c) are summed to yield 
the probability of obtaining an array as probable 
as or less probable than the observed array. 

The general expression for the probability 
formula for any array stated in verbal terms is 
as follows: The probability of any array is equal 
to a ratio where the numerator is the product of 
the factorials of all border totals and the denom- 
inator is the product of the total sample size 
raised to the power of one less than the number 
of attributes and of the factorials of all cell fre- 
quencies. 


The formula for computing the probability of 
having obtained any array through random sam- 





March, 1957) 


pling from a set of all arrays with fixed margin- 
al totals is given symbolically for the two-way 
case as 


IIn; ! IIn ;! 
io j 


n, ‘HU Mn; 
ij 





j . 


and for the three-way case as 


IIn; ‘Tn; ! Un 4 
i is j oJ k - 
P = 





2 ' 


Design of the Survey 





The Questions Under Investigation— The 
questions under investigation from which the null 
hypotheses were deduced were these: 





1. Are attributes indicating the extentto which 
graduates utilized their training in actual 
teaching positions related to those attributes 
which were known about them at the time of 
their graduation? 


. Are the attributes which were known about 
graduates at the time of their graduation in- 
terrelated? 


. What were the reasons for graduates’ not en- 
tering teaching or for failing to teach during 
one or more of the first five years? 


Description of the Variables—The variables 
selected for study were considered as two sets 
of attributes: (1) pregraduation attributes, and 
(2) postgraduation attributes. The twelve pre- 
graduation attributes were 





Year graduated 

. Quarter graduated 

Sex 

. Major field 

. Honors at commencement 

. Previous degree 

Rank with major peer group 
Age 

. Teaching previously 

10. Duration of previous teaching 
11. Military service 

12. Extra-curricular activities 


The four postgraduation attributes were 


1. Taught since graduation 
2. Taught during fifth year 
3. Taught continuously 





4. Intention toward future teaching 


The attributes will be described later in some de- 
tail. First, however, the concept of ‘‘attribute’’ 
will be defined. 

The term ‘‘attribute’’ in the present paper re 
fers to a set of mutually and exhaustive ‘‘cate - 
gories’’, so that any person in a sample of indi- 
viduals can be classified as belonging in only one 
category of a givenattribute. The simplest 
form of attribute is the dichotomy, and most of 
the attributes considered here are of this type. 
Thus, a person either has or does not have the 
characteristic specified by the attribute. For ex- 
ample, persons were classified on the basis of 
‘thonors’’ by the categories ‘‘yes’’ and ‘‘no’’. 
Placement in category ‘‘yes’’ means that the per- 
son had honors conferred at commencement, 
while ‘‘no’’ designates that no honors were con 
ferred. The process of classifying persons into 
categories of an attribute involves the operation 
of counting the discrete number of persons in 
each category. This operation is fundamentally 
different from the process of measurement 
where a scale of more or less equal units is 
applied to a person or object and the measured 
number is an approximate number involving a 
whole number and a decimal fraction, such as in 
length the quantity 6. 54 inches. 

When a sample of individuals are classified 
on the basis of two attributes simultaneously, we 
count the number of persons who may possess 
jointly any of the possible combinations of pairs 
of categories. Again, it should be noted that 
classification is mutually exclusive and exhaus- 
tive. Thus, any person or object would be count- 
ed once and only once for any given contingency 
table. 

The pregraduation and postgraduation attri- 
butes which were listed previously will now be 
described briefly. 

The first of the pregraduation attributes was 
‘‘year graduated.’’ A year here refers toa per- 
iod beginning with a December commencement 
and extending through an August commencement. 
The particular years given are the years during 
which the August commencements fell. This at- 
tribute was a dichotomy, and the two categories 
were 

1948 
1949 


The second pregraduation attribute was ‘‘quar - 
ter graduated.’’ This refers to the session dur- 
ing which the degree was conferred. The two 
summer sessions were treated as quarters. The 
five categories were 

Fall 

Winter 

Spring 

Summer Session I (July commencement) 





JOURNAL OF EXPERIMENTAL EDUCATION 


Summer Session II (August commencement) 


The two categories of the third attribute, 
‘*sex’’ were 
Male 
Female 


The fourth pregraduation attribute was ‘‘ma- 
jor field.’’ The fifteen categories were 

Art education 

Business education 

Elementary education 

English 

Foreign languages (French, German, Span- 
ish and Latin) 

Industrial education 

Mathematics 

Music education 

NKP (Nursery-kindergarten-primary) 

Physical education (men) 

Physical education (women) 

Science 

Social studies 

Speech 

Special fields (speech pathology and handi- 
capped children) 


The fifth pregraduation attribute was ‘‘hon- 
ors at commencement.’’ The two categories 
were 

Honors conferred (Yes) 
No honors conferred (No) 


The sixth pregraduation attribute was ‘‘pre- 
vious degree.’’ This attribute refers to a grad- 
uate’s having had a bachelor’s degree previous 
to earning the B.S. in Education degree. The 
categories were 

Had previous degree (Yes) 
Had no previous degree (No) 


The seventh pregraduation attribute was 
‘‘rank with major peer group.’’ The ranks re- 
ferred to were obtained from rank-order lists 
submitted by the major departments to the Bur- 
eau of Recommendations each spring. The three 
categories were 

High 
Medium 
Low 


The eighth pregraduation attribute was ‘‘age 

at time of graduation.’’ Initially, this was a 
continuous variable, and the age was deter - 
mined to the nearest whole year. The range 
of ages was divided into four class-inter vals. 
The four categories in terms of the limits of 
the class-intervals were 

20-22 years 

23-27 years 





(Vol. 25 


28-31 years 
32-59 years 


The four class-intervals for ‘‘age’’ were treat- 
ed just as if they were categories of an attribute. 
The ninth pregraduation attribute was ‘‘taught 
previously.’’ The two categories were 
Taught previously (Yes) 
Had not taught previously (No) 


The tenth pregraduation attribute was ‘‘dura- 
tion of previous teaching.’’ Like ‘‘age’’, this 
was initially a continuous variable, and duration 
was determined to the nearest whole year. The 
range of durations in years was divided into two 
class-intervals, so that the attribute became a 
dichotomy. The two categories were 

1- 4 years 
5-30 years 


The eleventh pregraduation attribute was 
‘‘military service.’’ This refers only to mili- 
tary service during World War II or between the 
end of World War Il and commencement. The 
two categories were 

Veteran (Yes) 
Non-veteran (No) 


The twelfth pregraduation attribute was ‘‘ex- 
tra-curricular activities.’’ Evidence of partici- 
pation in extra-curricular activities was obtained 
from an open-ended question on the Registration 
Blank of the Bureau of Recommendations. To be 
classified as a participant, a graduate must have 
shown at least four acceptable activities from 
school and college combined. Accept able 
activities did not include jobs for which pay 
might have been received, activities in military 
service, academic scholarship or organizations, 
athletic activities for physical education majors 
or art activities for art majors. The two cate- 
gories were 

Evidence of marked participation (Yes) 
No evidence (No) 


The first of the postgraduation attributes was 
‘‘taught since graduation.’’ This attribute wasa 
dichotomy in which possession of the attribute 
indicates that the graduate has done some teach- 
ing during the first five years since graduation. 
The two categories were 

Taught since graduation (Yes) 
Had not taught since graduation (No) 


The second of the postgraduation attributes 
was ‘‘taught during fifth year.’’ This attribute 
is a dichotomy in which possession of the attri- 
bute indicates that the graduate was teaching 
during the fifth year after graduation. The two 
categories were 





March, 1957) 


Taught during fifth year (Yes) 
Did not teach during fifth year (No) 


The third postgraduation attribute was ‘‘taught 
continuously.’’ This attribute was treated both 
as a trichotomy and as a dichotomy when either 
the latter two categories or the first and third 
categories were combined. The three categor- 
ies were 

Had taught continuously since graduation 

Had teaching service interrupted for at 
least one year 

Had not taught since graduation 


The fourth postgraduation attribute was ‘‘in- 
tention toward teaching.’’ This attribute was a 
polytomy with four categories, although in some 
cases categories were combined in order to fa- 
cilitate the analysis. The categories were set 
up on the assumption that they formed a hier - 
archy of postgraduation experiences and atti- 
tudes in regard to the extent to which graduates 
actually intended to apply their education train- 
ing in actual teaching positions. The four cate- 
gories were 

Teaching during fifth year 

Not teaching; plan to teach in future 

Not teaching; uncertain about future teach- 
ing 

Not teaching; do not plan to teach in future 


Population and Samples—The population was 
defined as graduates of the College of Education 
of the University of Minnesota. The two sam 
ples in the present study consisted of two class- 





es, each class including graduates from all com- 


mencements over a one-year period extending 
from a fall commencement of one year through 
the August commencement of the following year. 


Collection and Analysis of Data 





Sources of Data—Pregraduation data were 
collected from personnel records in the Bureau 
of Recommendations of the University of Minne- 
sota; postgraduation data were collected by 
means of a mail questionnaire. Pregraduation 
data were available for every graduate; in col- 
lecting postgraduation data, attempt was made 
by means of persistent follow-up procedures to 
secure at least a 90 percent response to the mail 
questionnaire. 

The entire questionnaire was printed on the 
back side of a postal card. The instructions 
asked the respondent to indicate every school 
year in which he had taught since graduation by 
marking the year from a list on the card. For 
every year in which he had not taught, he was 
asked to indicate the reason or reasons for 
not teaching from a list of reasons given in the 
accompanying letter. The list of reasons was 








MAYO 


as follows: 


Salary 

Working conditions 

. Health (personal or family) 
Maternity 

Military service 

Further study 

Marriage 

Held non-teaching job in business or 
industry 

9. Suitable teaching jobs unavailable 
10. Other reasons (to be specified by re 
spondent) 


eonourwnr 


Response to the Questionnaire —Of the 940 per 
sons in the total samples for the two years, two 
were found to be deceased. Attempt was made 
to reach the remaining 938 cases by means of 
the mail questionnaire. 

Table I presents the number and percentage 
of persons responding to the questionnaire, 
along with pertinent information about the extent 
to which persons could be reached. Based upon 
the estimated number of persons reached, the 
response to the questionnaire was 91.3 percent 
for the total sample. 

Analysis by Simple Contingency Tables—In 
answering the first of the questions under inves- 
tigation, which concerns the relationships be 
tween pregraduation and postgraduation attri 
butes, analyses were made largely by means of 
simple, or two-way contingency tables; however 
in some instances complex tables were employed 
A thorough analysis was made of associations 
between the more than one hundred possible 
pairs of attributes which may be set up from the 
twelve pregraduation attributes and the four post 
graduation attributes, where associations were, 
for the most part, studied separately for the two 
years of graduation. Since these analyses in- 
volved merely the usual chi-square testand com 
prise reiterative processes largely, a summary 
of the results of the complete set of analyses di 
rected toward answering the first question under 
investigation is given in the present paper. De- 
tails of the tables have been reported elsewhere 
by the author (7). The results are summarized 
in Table Il, where the probabilities of all signif 
icant chi-squares are tablulated 

It is appropriate here to specify the nature 
of the associations shown to be significant in Ta 
ble Il. The evidence here, which bears upon 
the first of the basic questions, showed nearly 
all of the pregraduation attributes to be associ 
ated with one or more of the postgraduation at 
tributes. 

The attribute ‘‘taught since graduation’’ was 
found to be associated more often with the grad 
uates who were female, who were not majoring 
in social science, who were younger, and who 








JOURNAL OF EXPERIMENTAL EDUCATION 


TABLE I 


PARLTICIraA-LON IN FPOLLOW=Ui LUDY Ov 


PeRSUNS GanDUATING Frum THE 
COLLLG: OV: UCALLONW O. 


[THe UNIVeHSITY OF wmINNnSOTA JITH THE 
Bed. DTN evUChTLUN DeGAb Geil ienN DECwsbuk, 19 1) AND 
AUGUST, 1949, INCLUSIVE, ANY AeGIcTeRING /ITH THE 
BUREAU OF RECOMMENDATIONS 








Per Cent Per Cent 
Category Number fotal Contacted 


N = 90 N = 916 





Total number of graduates 
during period 100.0, 


Number of mraduates who 
could not be contacted 
by mail questionnaire 


Number of graduates known 
to be deceased 


Number of graduates thouzht 
to have been contacted by 
mail questlonnatre 


iliumber of graduates 
for whom questionnaires 
were completed 


Number of fraduates 
for whom questionnaires 
were not completed 8.5 





*The total sample included all majors from the »inneapolis campus 


except Nursing Education, Hecreation Leadership, and Library Science. 





March, 1957) 


TABLE II 


SUMMARY Or SIGNIFICANT ASSOCIATIONS Buel j/EaN PeueGrRavUalTION AND 
PUSTGRADUATION ATTRISGUTES AS SHOWN BY CHI-SQUARE I'hol'S IN 
SIMrLe CONTINGENCY TABLES FOx 1948 AND 1949 GRADUATES 








Postcraduation attribute 





Taught Taught ,austht intention 
wince rifth Jontinuously oward 
Graduation Year Jeaching 


Precraduation Year of 
Attribute Graduation 





Year 1948 
Graduated 199 


uarter (years 
Graduated combined) 


Sex 19,8 
1949 


Major 1945 
Field 199 


honors 1945 
199 

Previous 19,8 
bezree 1949 


1948 
1949 


nang 


1948 005 
Age 199 e001 


Taucht 1948 re 205 
Previously 199 2001 


Duration of 1948 
Teaching 19449 202 


hilitary 1948 
service 199 


LUxtra- 1948 
Curricular 199 








218 JOURNAL OF EXPERIMENTAL EDUCATION 


were not veterans. 

The attribute ‘‘taught during fifth year’’ was 
found to be associated more often withthe grad- 
uates who were summer graduates, who were 
male, who were older, who had taught previous 
to graduation, who were veterans, and who were 
participants in extra-curricular activities. 

The attribute ‘‘taught continuously’’ was found 
to be associated more often with the graduates 
who were male, who were older, who had taught 
previous to graduation, who had taught for a long- 
er duration previous to graduation, and who were 
veterans. Among all graduates those who had 
interrupted teaching service were found most 
often to be those who were honor graduates, who 
were non-veterans, and who participated in ex- 
tra-curricular activities. Of the social science 
majors who did enter teaching, less had inter- 
rupted teaching service. 

The attribute ‘‘intention toward teaching’’was 
found to be related to pregraduation attributes 
as follows: the females either planned to teach 
in the future (if not now teaching) or else were 
uncertain about future plans to teach; those who 
were ranked low in their major peer group 
planned not to teach in the future more than 
those ranked medium or high; older graduates 
planned to teach in the future; those who had 
taught previously planned to teach in the future; 
those who had not taught previously were either 
uncertain or did not plan to teach in the future; 
non-veterans were uncertain about future plans 
to teach; participants in extra-curricular activ- 
ities were either uncertain or planned to teach 
in the future. 

Two simple tables will be considered in some 
detail, one to illustrate the exact method of 
Freeman and Halton, the other to illustrate 
scoring methods. The exact method is present- 
ed in this section of the paper, while scoring 
methods are deferred until after the illustra- 
tions on complex tables. The reason for this 
particular order of presentation is that the 
data for the scoring illustrations depends upon 
the results of analyzing the complex tables. 

In order to illustrate Fisher’s exact treat- 
ment for 2 x 2 contingency tables and the exact 
treatment of Freeman and Halton for contin- 
gency tables other than the 2 x 2, the associa - 
tion between ‘‘extra-curricular activities’’ and 
‘‘taught fifth year’’ was investigated in a small 
sample of twenty cases. This sample consist- 
ed of the Industrial Education majors for the 
1948 graduates. These data are given in Table 
Ill. The obtained chi-square by the uncorrect- 
ed shortcut formula was 4.3, with 1 df, which 
is significant at the .05 level. Chi-square by 
the corrected shortcut formula was 2.1 with 1 df, 
which is not significant at the .05 level. These 
results leave one in something of a dilemma. If 
we accept the results of the corrected formula, 





(Vol. 25 


we must conclude that the results are non-signif- 
icant. If we accept the results of the uncorrect- 
ed formula, we conclude that the resultsare sig - 
nificant, but with the knowledge that we have vi- 
olated the rule for minimum-sized theoretical 
frequencies. Such a situation as this is one 
where exact methods are very applicable. 

First, we use Fisher’s exact method to deter- 
mine whether the results are in fact significant 
or not. The principle of Fisher’s exact test con- 
sists in assuming that the marginal totals remain 
fixed, using a formula yielding the probability of 
having any set of cell frequencies, calculating 
the probabilities for the observed set of cell fre- 
quencies and all more ‘‘extreme’’ arrays (i.e., 
arrays which would yield a chi-square of great- 
er magnitude than that for the observed array), 
and summing these probabilities. This summed 
probability gives the probability of having ob- 
tained by chance alone an array as ‘‘extreme’”’ 
or ‘‘more extreme’’ than the observed array, 
and this probability is then referred tosome pre- 
determined significance level, suchas the .05 
level. For example, if the obtained probability 
were more than .05, we would accept the null 
hypothesis, and we would reject the hypothesis 
if the probability were less than .05. It would 
be noted that this is a one-tailed test of signif- 
icance. 

Using the exact formula given previously for 
the two-way table, we find that the probability of 
the observed array as given by the data of Table 
Ill is 


P= IOI aor =0-074 


Since no other array with the same marginal tot- 
als has a probability smaller than or equal to 
0.074, this is in fact the probability correspond- 
ing to the significance level for thearray. Since 
this probability is exact, it is evident that the 
uncorrected shortcut formula gave an underesti- 
mate of the probability and the corrected short- 
cut formula an overestimate. Theterm ‘‘exact’’ 
here refers to the fact that an exact test leads to 
exact probabilities for testing significance. 

It was suspected that a significant association 
might be found if the degrees of freedom for the 
sample of twenty cases were increased. There- 
fore, the data were rearranged into a 2 x 3con- 
tingency table by a regrouping of the basic cate- 
gories; the former category of ‘‘did not teach 
during fifth year’’ was divided into two new cate- 
gories. The data are given in Table IV. 

The exact probability of obtaining the array 
in Table IV by random sampling alone is given 
by 


P = 


ms itt 2k he 


= 0.030702 





March, 1957) MAYO 


TaBLe III 


2x 2 CONTINGENCY TASLZ TO TeST ASSOCIATION OF "EXTRA-CURRICULAR 
ACTIVITIES" AND THE ATTRIBUTES, "TAUGH! DURING FIFTH YEaR", 
FOR 1948 INDUSTRIAL EDUCATION MAJORS 


Extrae 
Curricular 
Activities Yes 


Taught During Fifth Year 





2 


) | 
(0.07) (100.0,) 


11 ; 17 
( 6l167)) (100.0,) 











11 20 
2 - 
X 8 4e3 (without Yates' correction) 


df sil 
202 r< 005 


| 
-& 


X° = 261 (with Yates' correction) 
af zi 


010M P< 20 


TABLE IV 
2 x 3 CONTINGENCY TABLE TO TEST ASSOCIATION OF "EXT HA-CUAKICULAR 
ACTIVITIES” AND THE ATTRIBUTE, "VAUGHT DUXING FIFIH Year", 
FOR 1948 INDUSTRIAL EDUCATION MAJOKS 


Taught During rifth Year 


Yes 


No 





0 
faught fifth year (0.0%) 


es oo 3 
Taught, left, and have not returned (100.03) 


Had not taught since graduation (0.0%) 
VP 





11 
(64.7%) 


4 
(23-5,) 


< j 
(11.8%) 








c—_ 
(100.0%) 


17 
(100.0%) 


Extra-Curricular Activities 


20 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE V 


ALL POSSIBLE ARRAYS WITH WIXED WAKGINAL TOTALS AS 
tIVEN BY HE DATA OF TABLE IV 

















































































































March, 1957) MAYO 


VASBLE VI 


PROBABILITI#S FOR aLL POSSIBLE AAHAYS WITH 
BY OBSERVED 2 x 3 TALLE OF ATTAISZUYES, 
ACTIVIYVIES" AD "TAU ia UUMLNG rTietH 


1948 INDUSTRIAL £DUCATION LAJ 








M1 N12 73 





eVOY 
eo) 
0990 


o liyly 





1.000 000 





LADLE VII 
2x 2X 2 CONTINGENCY fable Of AYTAIBUY SS, PRAeCU AA 


ACTIVITI“S", AND “VAUGHT rIvlH You YUd 1946 AND 
‘OL:BI 


1949 GAALWal ES 


wale 

saught 

rirth 
Year 


Yes No 








Extra- . 719 62 
Curricular 
Activities l2l 























222 JOURNAL OF EXPERIMENTAL EDUCATION 


The method of Freeman and Halton asks us to 
consider the probabilities of all possible arrays 
for a given set of marginal totals. Before pre- 
senting the probabilities for these arrays, it is 
well to consider the nature of the arrays the m- 
selves. The nine 2 x 3 tables given in Table V 
(a through i) represent all possible arrays with 
the marginal totals of Table IV. Array a of Ta- 
ble V is the same as the data of Table IV. Prob- 
abilities of the other eight arrays were comput- 
ed in like manner. The probabilities for the com- 
plete set of nine arrays are given in Table VI. 
There are two probabilities smaller than that of 
the observed array a, which is 0.030702; these 
two probabilities are those for arrays c and f 
and are numerically equal to 0.006140 and 
0.009649, respectively. When these three prob- 
abilities are summed, the total is equal 
to 0. 046491. This figure represents the proba- 
bility of having the observed array or anyarray 
more extreme. Since the obtained probability 
is less than .05, we reject the null hypothe sis 
of no association between the attributes. 

The results of the foregoing test of signif- 
icance indicates that there is evidence of associ- 
ation between ‘‘extra-curricular activities’’ and 
‘taught during fifth year’’ for the 1948 Industri- 
al Education majors. If, however, one had used 
only the 2 x 2 table, the evidence would not have 
been found; a 2 * 3 table was needed in order to 
uncover the relation. 

These results raise the question of the nature 
of the association. The association may be in- 
terpreted by noting the percentages in the var- 
ious cells of the 2 x 3 table. Among the three 
persons who had participated in extra-curr ic u- 
lar activities, all (or 100.0 percent) were found 
to have taught sometime during the first five 
years, to have left teaching and not to have re- 
turned by the fifth year. Among the seventeen 
persons who had not participated in extra-cur- 
ricular activities, a relatively high percentage 
(64.7) were found to have taught during the fifth 
year. We may state the relationship, then, by 
saying that participants in extra-curricular ac- 
tivities tended not to teach during the fifth year, 
while non-participants tended, by and large, to 
teach during the fifth year. These results are 
similar to those found for all majors for the 
1948 and 1949 samples. In addition, for the In- 
dustrial Education majors it was found that par- 
ticipants tended to teach for a short period, to 
leave teaching and not to return by thefifth year. 

Analysis by Complex Contingency Tables— 
Two complex contingency tables are Ilustrated 
here. The first is a 2 x 2 x 2 table illustrating 
the method of Bartlett and Snedecor, while the 
other is a four-way table illustrating the 1i keli- 
hood ratio method. 

In order to illustrate the method of Bartlett 
and Snedecor for the 2 x 2 « 2 table, the data 








(Vol. 25 


were Classified by the two pregraduation attri- 
butes ‘‘sex’’ and ‘‘extra-curricular activities, ”’ 
and by the postgraduation attribute, ‘‘taught dur- 
ing fifth year.’’ The results of testing the asso- 
ciation of the postgraduation attribute, ‘‘taught 
fifth year,’’ with the pregraduation attributes, 
‘‘sex’’ and ‘‘extra-curricular activities’’ individ- 
ually in two-way tables had showed strong asso- 
ciation (i.e., probabilities smaller than .001) in 
every instance. It was suspected that there might 
be association among all three of the attributes 
or what has been called ‘‘second-order inte rac- 
tion’’ by Bartlett and Snedecor. The kind of 
second-order interaction suspected in the pres- 
ent instance was that of a sex difference in the as- 
sociation between the two attributes, ‘‘extra- 
curricular activities’’ and ‘‘taught fifth year. ’’ 

The 2 x 2 x 2 table is shown in Table VII. The 
first step in the analysis of the data was to exam- 
ine each of the 2 x 2 tables which were classified 
by the two attributes, ‘‘extra-curricular activi- 
ties’’ and ‘‘taught fifth year,’’ for independence. 
Thus, there were separate 2 x 2 tables for the 
two sexes, one for males, the other for females. 
The obtained chi-square for males was 0. 447, 
with 1 df, which is not significant at the .05 level; 
chi-square for the females was 18.665, with 1 df, 
which was significant at the .001 level. 

The second order interaction was calculated 
as follows: 


adb' c' (79)(121)(81)(168) 
ava™be ~ (TS(TBYTTT(2y_ = 185 


In order to test the hypothesis that in the popula- 
tion the second-order interaction is unity, the val- 
ue of the deviation |x| common to all eight cells 
of the 2 x 2 x 2 table was determined by solving 
the equation 

(a-x)(d-x)(c' -x)(b' -x) = (c+x)(b+x)(a' +x)(d" +x) 


Substituting the cell frequencies from our data, 
we obtain the equation 


(79-x)(121-x)(168-x)(82-x) = 
(62+x)(176+x)(73+x)(75+x) 


which reduces to 
514 x® - 24,664 x? - 4,318,692x - 71,941,584 = 0 


A first approximation may be obtained by ignor- 
ing the first two terms and solving the equation 


4, 318, 692x - 71, 941, 584 = 0 


Solving this equation, we obtainasa first 
approximation 





March, 1957) 


_ 71,941,584 _ 
x= “T3I8. 692 = 16.658 


, ’ 


As a first step in finding two values of x between 
which the left-hand member of the equation 
changes sign, the obtained x above was rounded 
off to 16 and substituted in the cubic equation, 
yielding the value 5, 576, 816 for the left-hand 
member. The second step was to substitute the 
next lower integral number 15 for x, yielding the 
value 122,946, which is smaller but which isnot 
negative. When 14 was substituted for x, the 
left-hand member yielded the negative value 
-5, 235, 336, indicating the desired change in 
sign. Therefore, the value of x was known to lie 
between the values 14 and 15. 

An improved approximation was obtained by 
solving the equation 


x-14 5,235, 336 
: -s ae 122, 946 
where x = 14.9 

A better approximation was desired, and it 
was found that the left-hand member was 
14, 834. 76 for 14.98 and was -39, 205. 44 for x 
= 14.97. The equation 


x - 14.97 


39, 205. 44 
14.98 -x ~ 14,834.76 


’ 


from which 

x = 14.977 
The obtained value of x, rounded off to two dec- 
imal places, or x = 14.98, was substituted in 


the formula for chi-square as follows: 


ot - 2 1 1 1 1 
ee Ge - Oe (setae tox re 


P 1 1 
-(14.98-0.5)* ( aerecay * Tare * 


1 1 1 1 
168-14. 98 * 82-14.98 * 62.14.98 * T76,17-98 


1 1 


-(14.48)* (aaron Toon * Te-oE * eT 2 


1 1 1 1 
+ 76'58 * T50-B6 * BT BB * BO-uB) 


= 224. 400(0. 0872) 





MAYO 


= 19. 568 


The obtained chi-square for the second-order 
interaction is 19.568, with 1 df, which is signif- 
icant at the .001 level. Therefore, we reject the 
null hypothesis of no interaction. 

Rejecting the hypothesis of no interaction 
raises the question of the nature of the interac- 
tion. A significant interaction tells us that there 
is a difference in the nature of the association 
of the other two attributes for the sexes. The 
differences are such that among the males, par- 
ticipants and non-participants in extra-curricu- 
lar activities were equally likely to have taught 
during fifth year, while among the females the 
participants in extra-curricular activities tend- 
ed not to teach during the fifth year. 

A four-way contingency table classified by 
‘‘vear’’, ‘‘rank’’, ‘‘honors’’ and ‘‘extra-curric- 
ular activities’’ is given in Table VIII. By means 
of the likelihood ratio, five hypotheses were test- 
ed for this table. They were 


1. The four attributes, ‘‘year’’, ‘‘rank’’, ‘‘hon- 
ors’’, and ‘‘extra-curricular activities’’ are 
mutually independent. 


. The three attributes, ‘‘rank’’, ‘‘honors’’, and 
‘‘extra-curricular activities’’, are mutually 
independent. 


. The two attributes, ‘‘rank’’ and ‘‘honors’’are 
independent. 


. The two attributes, ‘‘rank’’ and ‘‘extra-cur 
ricular activities’’ are independent. 

. The two attributes, ‘‘honors’’ and ‘‘extra- 

curricular activities’’ are independent. 


In Table VIII the letters A, B, C, and D in 
our symbol system represent the attributes in 
the following order: ‘‘year’’, ‘‘rank’’, ‘‘honors”, 
and ‘‘extra-curricular activities’’ 

It will be recalled that any of the likelihood 
ratio criterion formulas for estimating chi- 
square for all possible hypotheses based upona 
four-way table could be expressed in terms of 
combinations of eighteen terms which were nat- 
ural logarithms of functions of cell or marginal 
totals. In order to test the five hypotheses for 
the present four-way table, only twelve of the 
eighteen terms were required. The twelve 
terms, along with the calculated numerical 
values are shown on the next page. 

The required frequency totals for the twelve 
terms were obtained by summing frequencies ac - 
cording to the indicated process. For example, 
the totals denoted nj, ,, which represent the to- 
tal frequencies for the years 1948 and 1949, 
were obtained by adding all cell frequencies for 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE VIII 


iVLEX CUONLINGANCY YABLE CLASSIFIED BY 1HE AllTRIBUTZo, *YinR", 


"HANK", "HOl ns" Ald "EXT RAeCURKICULAR ACYIVITI us" FOR 


GRADUATES OF 19458 AND 199 
Honors 
Yes ilo 
bxtrae wuxtrae 


Curricular jurricular 


Yes 








1948 


























19,9 























March, 1957) MAYO 
( 1) 10g. @ —_ . 30258 logy oj... = 3332.67612 
( 2) loge a. . 30258 10g 19". j.. 3090. 81427 


logyo". 3434. 65550 


loge Un. 4 . 30258 


10 . ee . 30258 logy9". ..h 3321. 90198 


) 
log = . 30258 logy qn; > 2030. 79971 
e ijkh 10° ijkh 


tik. 
loge ij . 30258 10810" jk. = 2429. 38732 


(13) loge In y= . 30258 10810". jx. 2820. 43333 


.jeh 
(14) 10g @ " . 30258 10g 19". jh = 2687. 88973 


(15) loge Mn . 30258 kh 19619". 4h = 3032. 68764 


(16) loge (2. 30258)(n_ 10g 19". pied - 3727. 39025 


(17) loge (2. 30258)(2n logy gn. 7454. 78051 


(18) 10€ 6 (2. 30258)(3n om 10849" io @ 11, 182.17076 


1948 separately, then all cell frequencies for 
1949. The summation process symbolically is 
uts 
ee hk} Nijkh 


and since 1948 has been taken arbitrarilyas the 
first of the ith categories for ‘‘year’’, sothat 
i = 1 for 1948, the total symbolically for 1948 
would be 


ee. 
a Pe tree 


This expression tells us to add all the cells for 
1948 in a particular order, viz., tosum the cate- 
gories for ‘‘rank’’ first, then ‘‘honors’’, then 
‘‘extra-curricular activities’’. The numerical 
process was as follows: First, cell frequencies 
were added across the ‘‘rank’’ categories, so 
that in the upper left-hand 3 x 2 table we add for 
the first column 


Rs.a0 © 4 Maju = Nyasa * Magis + Nass = 





13+ 10+ 3 = 26 


The three remaining totals for 1948 were added. 
These totals are numerically equal to 12, 98 and 
98, respectively. The totals for extra-curricu- 
lar activities were summed across the categor- 
ies of honors as follows: 

2 

My..h= 2 My, kh 
b 


The two totals for ‘‘extra-curricular activities’’ 
were 


n,.., = 26+ 98 = 124 
ny... = 12+ 98 = 110 


When these totals are summed across the two 
categories of ‘‘extra-curricular activities’’, we 
obtain the total frequency for 1948 
2 
n,..* ~My. oh = 124+ 110 = 234 


The totals needed for the remaining eleven terms 





226 JOURNAL OF EXPERIMENTAL EDUCATION 


were obtained in similar fashion. 

The twelve terms were calculated by using 
the summation form of the formula, by substi- 
tuting logarithms to the base 10 initially, then 
changing to the base e by multiplying by the con- 
stant e = 2.30258. For example, term (1) was 
calculated numerically as follows: 


n 
loge nj.’ = 2.30258 Zn, logyony 
1 1 


2. 30258 [ (234)(2. 36922) + (351)(2. 54407) | 
(2. 30258)(1447. 36605) 
3332. 67612 


In order to test the first hypothesis, that the 
four attributes are independent, we use the form- 
ula of the first kind of hypothesis for afour-way 
table. The expression for the likelihood crite r- 
ion in terms of sums and differences of certain 
of the eighteen terms is as follows: 
loge ®% = (1) + (2) + (3) + (4) - (18) - (5) 
Substituting the numerical values of the terms 
we have 


loge > = 3,332.67612 + 3,090.81427 + 3,434.65550 
+ 3,321.90198 - 11,182.17076 - 2,030.79971 


= -32. 92260 


Using the formula for estimating chi-square in 
terms of », we have 


X* = -2 loge d 
= (-2)(-32. 92260) 
= 65. 84520 


where the degrees of freedom is computed as fol- 
lows: 


df= rstu-(r+s+t+u)+3 
= (2)(3)(2)(2) - (2+3+2+2) +3 
= 18 


The obtained chi-square for the first hypothesis, 
that is, 65.845, with 18 df, is significant at the 
.001 level. Therefore, we reject the null hypoth- 
esis. 

In order to test the second hypothesis, that 
the three attributes ‘‘rank’’, ‘‘honors’’ and 
‘‘extra-curricular activities’’ are mutually inde- 
pendent, we use a formula of the second type. 
The likelihood ratio criterion is 





(Vol. 25 


loge % = (2) + (3) + (4) - (17) - (9) 


Substituting the numerical values of the five terms, 
we have 


loge A = 3,090.81427 + 3,434.65550 + 3,321.90198 
- 7,454. 78051 - 2,429. 38732 


= -36. 79608 


from which 


x* = 73. 59216 
The df is given by 


df = rstu -(s +t + u) -2 
= 7 


The obtained chi-square for the second hyp oth- 
esis, that is, 73.592, with 7 df, is significantat 
the .001 level. Therefore, we reject the null hy- 
pothesis. 

In order to test the third, fourth and fifth hy- 
potheses, formulas of the third type were used. 
The appropriate formulas for an estimate of chi- 
square to test the third hypothesis, that ‘‘rank’’ 
and ‘‘honors’’ are independent, is expressed in 
terms of sums and differences of the eighteen 
terms as follows: 
loge > = 3,090.81427 + 3,424.65550 - 3,727.39025 

-2, 820. 43333 


= -22. 35381 
from which 


X* = 44. 70762 
The df is given by 


df = (s - 1)(t - 1) 
=2 


The obtained chi-square for the third hypothesis, 
that is, 44.708, with 2 df, is significant at the 
-001 level. Therefore, we reject the null hypoth- 
esis. 

The appropriate formula for an estimate of 
chi-square to test the fourth hypothesis, that 
‘‘rank”’ and ‘‘extra-curricular activities’’ are in- 
dependent, is expressed in terms of certain of 
the eighteen terms as follows: 


loge % = (2) + (4) - (16) - (14) 


Substituting the numerical values of the four 
terms, we obtain 


loge A = 3,090.81427 + 3,321.90198 - 3,727.39025 








March, 1957) 


- 2, 687. 88973 
= -2. 56373 
from which 
X* = 5.12746 
The df is given by 


df = (r - 1)(u - 1) 
= 2 


The obtained chi-square for the fourth hyp oth- 
esis, that is, 5.127, with 2 df, is not signifi- 
cant at the .05 level. Therefore, we do not re- 
ject the null hypothesis. 

The appropriate formula for an estimate of 
chi-square to test the fifth hypothesis, that ‘‘hon- 
ors’’ and ‘‘extra-curricular activities’’ are in- 
dependent, is expressed in terms of sums and 
differences of certain of the eighteen terms as 
follows: 


loge » = (3) + (4) - (16) - (15) 


Substituting the numerical values of the four 
terms, we obtain 


loge > = 3,434.65550 + 3,321.90198 - 3,727.39025 
-3, 032. 68764 


= -3. 52041 
from which 
x* = 7.04082 
The df is given by 


df = (t - 1)(u - 1) 
1 


The obtained chi-square for the fifth hypothesis, 
that is, 7.041, with 1 df, is significantat the 
.01 level. Therefore, we reject the null hypoth- 
esis. 

Analysis by Scoring Techniques—The m eth- 
ods of Yates and of Williams were illustrated 
with a 3 x 3 contingency table. The dataare giv- 
en in Table IX. Here the classification is based 
upon the three attributes, ‘‘rank’’, ‘‘honors’’, 
and ‘‘extra-curricular activities’’. The likeli- 
hood ratio analysis had shown mutual assoc i- 
ation among these three attributes. The classi- 
fication of rows in Table IX is by categories of 
‘*rank’’ and the classification of columns is bya 
combination of the categories of ‘‘honors’’ and 
‘‘extra-curricular activities’’. 

The obtained chi-square for Table IX was 
34.436, with 4 df, which is significant at the 





MAYO 227 


.001 level. Therefore, we reject the null hypoth- 
esis of independence of the classifications. Since 
this is a contingency table in which association is 
known to exist, the methods of Yates and Williams 
are applicable. Especially is this true since 
there appears to be a kind of departure in the ta- 
ble which resembles the arrangement of frequen- 
cies in a scatterdiagram in which positive corre- 
lation exists. For example, in the first column 
of Table [X there is a tendency for the cases to 
‘‘pile up’’ in the lower left-hand corner (34.3 per- 
cent) as compared with those in the upper left- 
hand corner (20.6 percent), while inthe thirdcol- 
umn there is a tendency for the cases to ‘‘pile 
up’’ in the upper right-hand corner (55.0 percent) 
as compared with those in the lower right-hand 
corner (12.7 percent). In the methods of Yates 
or of Williams, we test whether the association 
between row and column scores is significant, 
where the scores are assigned on the assumption 
of an underlying quantitative basis. 

The Yates solution follows the representation 
in Figure 1. The numerical scoring scheme for 
rows and columns and the summation terms 
which were calculated with the aid of these scores 
are given in Table X. The scores for rows and 
columns were chosen arbitrarily. It was conven- 
ient in the present data to make the score for the 
middle category of the rows and columns each 
equal to zero, in order to simplify the calc ula- 
tions. 

The three quantities, A, A' and B, needed 
for the regression formulas were calculatedfrom 
the summation terms as follows: 


A > x*n, -(Z xyn, )? 
-~ Ts. i. 


= 585(350) - (-10)* 
= 204, 650 


y -(F 2 
n r yyn ij — yjn j) 


= §85(319) - (-177)? 


= 155, 286 


= Zz (xyz - (Lx ny (ZL 
n,, pe ry ‘ mi. MEP 


585(64) - (-10)(-177) 


= 35, 670 


The two regression coefficients were calcu- 
lated as follows: 


byy = B/A 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE IX 


CONTINGENCY TABLE TO TEST ASSOCIATION BETWEEN "RANK" 
AND COMBINATIONS OF "HONORS" AND 
"EXTRA@CUHKRICULAR ACTIVITIES" 


Combinations of Honors and Extra-Curricular Activities 


Both "no honors Either “honors" Both "honors" 
and "no extra- and "no extra- and “extra- 
curricular" curricular" or curricular" 
"no honors" and 
"extra-curricular" 





51. 80 39 
(20.6%) (30.0%) (55.20%) 


lle . 100 F “ks 
(45e2%) (376%) (32 lve) 


85 86 9 
(343%) (32.3%) (12.7%) 











248 266 71 
(100.1%) (99.9%) (100.14) 


x? 


af s 


TABLE X 


SCORING SCHEME AND SUMMATION TERMS IN YATES! 
SOLUYION OF DATA IN TABLE IX 





t 7514 


12 


-89 
-76 

















March, 1957) 


35, 670 


, 


0. 17430 
B/A' 


35, 670 


= 0. 22971 


The standard errors of the regression coef- 
ficients were calculated as follows: 


S.E. of byy = VA'/An_. 
155, 286 


= 0.03601 
and S.E. of byx = VA/A' nn. 
204, 650 


= 0.04746 


The significance of each of the regression coef- 
ficients was tested by means of the t test, where 
the hypothesis in each case was that the popula- 
tion regression coefficient was equal to zero. 
The t test for the b’s were performed as follows: 


0. 17430 4.840 
SE. of bxy 0. 03601 ‘ 


Dyx 0. 22971 4.840 
SE. of b,, yx 0.04746 


Referring the obtained t’s to the normal tables, 
we find that P< .001. Therefore, we reject the 
null hypothesis. We conclude, therefore, that 
the population regression coefficients are not 
equal to zero. 

As an alternative test, we may square t and 
we have X* = t# = (4. 840)? 


23.426 
df = 1 
P< .001 
Therefore, the association of row and column 
scores is shown to be significant. 
The row and column scores used in the Yates’ 


method were arbitrary, and we do not know that 
they were the most appropriate for thedata. The 





MAYO 229 


Williams’ method provides empirically-derived 
scores which yield the maximum value of the cor- 
relation coefficient which would be poss iblefor 
any set of scores for the same data. The Willi- 
ams’ solution follows the scheme given previous - 
ly in this paper. 

Starting with the original data in Table IX, the 
matrix G was computed by the formula for the el- 
ements as follows: 


Bij = iy/V yD 


For example, the element of matrix G inthe first 
row and first column was computed as follows 


Gi. = M,,/¥n, n , 
51/v (170)(248) 
= , 24838 


The matrix G is 


. 24838 . 37621 . 35500 
G = . 46394 . 39997 . 17806 
. 40231 . 39304 . 07961 


The matrix T was obtained by T = GG' 
and numerically the calculations were as given 
at the top of the next page. 

In order to calculate the scores for rows and 
columns which would maximize the correlation 
between the two sets of scores, it was necessary, 
first, to solve for the second largest latent root 
» of the matrix T. From T we have the determ- 
inant 


. 329 2516 -» .328 9174 
. 328 9174 
. 276 0529 


. 276 0529 
.406 9217 - > .358 0273 
. 358 0273 . 322 6715 ->» 


which when expanded yields the characteristic 
equation 


»* - 1.0588 »* - 0.0589 » = 0 


From Williams (12) we know that the largest 
root of the characteristic equation is unity, that 
the smallest root is zero, and that the remain- 
ing root, which we desire here, lies somewhere 
between zero and unity. Solving the cubic equa- 
tion by the method previously given, we find that 
the desired root 


>» = 0.0589 
The value of this root gives the square of the 
maximized correlation between the variates as- 


sumed to underlie the row and column scores. 


The correlation coefficient 





JOURNAL OF EXPERIMENTAL EDUCATION 


. 35500 
. 17806 


. 07961 


(Vol. 25 


. 24838 . 37621 
. 46394 . 39997 
. 40231 . 39304 


. 24838 
. 37621 
. 35500 


. 46394 
. 39996 
. 17806 


. 328 9174 
. 406 9217 
. 258 0273 


. 328 9174 


( . 329 2516 
. 276 0529 


. 322 6715 


ryy = VA = 0.0589 = . 24 


The scores for rows of the contingency table 
were calculated by first solving for a set of so- 
lutions bj to the set of simultaneous equations 


[T- 1} {bj} =0 


and then finding a set of scores proportional to 
the bj’s so as to satisfy the restrictions 


= 0 


Zn. xf =n. 
1 


To solve for the bj’s we first subtract \=0.0589 
from the three main diagonal terms of the ma- 
trix T as follows: 


.3293-.0589 .3289 . 2761 
T-dAl =| .3289 .4069-.0589 . 3580 
2761 .3580 .3227-.0589 


. 2704 . 3289 . 2761 
. 3289 . 3480 . 3580 
. 2761 . 3580 . 2638 


Using the elements of the matrix T - AI as co- 
efficients, we write the system of simultaneous 
linear equations 


. 2704 b, + . 3289 b, + .2761 b, = 
. 3289 b, + . 3480 b, + . 3580 b, = 
. 2761 b, + . 3580 by + . 2630 by = 


0 
0 
0 


Solving the equations, we obtain the values of 
the bj’s 


b, 3.731 
bz = -1.000 
by = -2.218 


The set of bj’s were transformed to a set of row 
scores xj so as to satisfy the restrictions given 
previously. The first step in this transfor ma - 
tion was that of expressing the three row scores 
all in terms of one of the row scores. The score 
for the second row, X;, is the easiest score com- 
putationally to deal with, for since we must ob- 
tain a set of xj’s proportional to the bj’s, we may 





. 276 0529 
. 358 0273 


express the three row scores in terms of x, as 
follows: 


x, = -3.731 x, 
Xo Xo 
X3 2.218 x, 


Substituting the values for the x;’s in terms of x, 
in the second restriction equation, we have 


& nyxf = 170(-3.731)?x2 + 235 x3 + 180(2. 218)? 
. x3 = 585 


from which 


xX, 1.528 
x, = -0.410 
x, = -0.909 


Checking to see that the obtained xj’s satisfy the 
two restrictions, we have 


DY nj. xj = 170(1. 528) + 235(-0. 410) + 180(-0. 909) 
! = -0.210 


and Eni. x; = 170(1. 528)? + 235(-0. 410)? 
i + 180(-0. 909)? = 585.148 


and conclude that both restrictions are satisfied 
with negligible error. 

The scores for columns of the contingency ta- 
ble were calculated as follows: 


51(1. 528) + 112(-0. 410) + 85(-0. 909) 


ys (. 24)(248) 





-0. 768 


80(1.528) + 100(-0. 410) + 86(-0. 909) 
(. 24)(266) 





0.049 


39(1. 528) + 23(-0. 410) + 9(-0. 909) 
(. 24)(71) 





= 2.489 


Checking to determine if the obtained yj's satis- 
fy the two restrictions, we have 


En jy¥j = 248(-0. 768) + 266(0. 049) + 71(2. 489) = 
) 





March, 1957) 


= -0.711 


and En _ jy; = 248(-0.768)* + 266(0.049)* + 
j 71(2. 489)* = 585. 6 


and conclude that both restrictions are satisfied 
with negligible error. 

As a check on the accuracy of the obtained 
row scores xj and column scores yj, the numer- 
ical values of the scores were substituted inthe 
formula for the maximized correlation coeffi- 
cient as follows: 


Ixy = Z nypxyyj/n, , 


= [ 51(1. 528)(-0. 768) + 80(1. 528)(0. 049) 
+ 39(1. 528)(2. 489) + 112(-0. 410)(-0. 768) 
+ 100(-0. 410)(0. 049) + 23(-0. 410)(2. 489) 
+ 85(-0. 909)(-0. 768) + 86(-0. 909)(0. 049) 
+ 9(0. 909)(2. 489)] /585 = .24 


which checks with the value of rxy found by the 
expression 
rxy = 


The significance of the correlation was test- 
ed as follows: 


X* =n. Pky = 585(. 24)? 


33. 696 
df =1 
P< .001 


Therefore, we conclude that the correlation is 
significant at the .001 level. According to Wil- 
liams, ‘‘The interpretation of the datais sim- 
plest if the association can be accounted for sat- 
isfactorily in terms of the correlation betweena 
single pair of variates corresponding to the two 
attributes.’’ One can readily imagine how the 
attribute ‘‘rank’’ might have an underlying vari- 
ate. Such a variate for the other attribute is not 
so readily conceived, although it is not incon- 
ceivable. One can only speculate as to the actu- 
al nature of the underlying variates, on the bas- 
is of the present data. We can interpret the as- 
sociation as being such that ‘‘low’’ ranking grad- 
uates tended not to be honor students and/or to 
be participants in extra-curricular activities, 
while, conversely, ‘‘high’’ ranking graduates 
tended also to be honor students and/or to be 
participants in extra-curricular activities. 


Summary and Implications 





This paper has described and illustrated a 
number of recent statistical techniques for ana- 
lyzing association of attributes in contingency 





MAYO 


tables. Among the techniques are (1) those 
which permit the testing of hypotheses about re- 
gression and correlation of assumed underlying 
variates in r X s tables in which association is 
known to exist; (2) those which permit the test- 
ing of hypotheses involving three or four attri- 
butes simultaneously in an r X s X t X u contin- 
gency table; and (3) exact tests of significance 
for small sample data in r X s and r X s X t con- 
tingency tables. 

The illustrative data were taken from an inves- 
tigation of the postgraduate experiences of two 
graduating classes from a large teacher-train- 
ing institution. Attributes of the graduates con- 
sisted of two main kinds: (1) pregraduation at- 
tributes, or those known at time of graduation, 
and (2) postgraduation attributes, or those char- 
acterizing the first five years of teaching exper- 
iences after graduation. 

The statistical techniques presented here are 
applicable to many educational problems where 
they are not now being utilized extensively. In 
the first place, there are the instances where 
frequencies are often reported along with the cor- 
responding percentages but without tests of sig 
nificance, although generalizations are made 
from the data. Secondly, the availability of 
such techniques should suggest the use of previ- 
ously ignored categorical data. Itis recommend- 
ed that such enumerative data augment rather 
than supplant continuous data. The techniques 
lend themselves well to data from surveys and to 
biographical data. Lastly, much data in the be 
havioral sciences are of such a nature that they 
are not immediately quantifiable; this does not 
mean, however, that they are inherently incap- 
able of quantification. Often, our operation up- 
on the data are of a primitive sort because of the 
immaturity of the science itself. It is to be hoped 
that techniques of contingency table analysis will 
assist in quantifying qualitative data. 

It is also to be hoped that the techniques pre- 
sented here will be given the widest possible dis - 
semination and application to appropriate prob- 
lems. 








REFERENCES 


1. Bartlett, M. S. ‘‘Contingency Table Inter- 
actions,’’ Supplement to Journal of the Roy- 
al Statistical Soctety, Il (1935), pp. 248- 
202. 

2. Cochran, W. G. ‘‘Some Methods for Strength- 
ening the Common X* Tests,’’ Biometrics, 


X (1954), pp. 417-451. 
3. Doi, A. N. Test for Independence of Four 


Classifications of a Complex Contingenc 
Table by the Method of the Likelthoed Ra- 


tio, unpublished M.A. colloquium paper, 
University of Minnesota, 1952, 44 pp. 














JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


. Fisher, R. A. Statistical Methods for Re- 
search Workers, 10thEd., Revised (New 
York: Hafner PublishingCo., 1946), 354 





Pp. 
. Freeman, G. H. and Halton, J. H. ‘“‘Note 
on an Exact Treatment of Contingenc y, 
Goodness of Fit and Other Problems of 
Significance,’’ Biometrika, XXXVIII 
(1951), pp. 141-149. 


. Gulliksen, H. Theory of Mental Tests (New 
York: John Wiley and Sons, 1950), 486 
pp. 

. Mayo, 8S. T. Some Designs for the Collec- 

tion of and Methods for the Analysis of 

Enumerative Data Characteristic of At- 

tributes, with Special Applications to a 

ollow-up o ucation Graduates. - 

oublished Ph.D. Thesis (University Mi- 

crofilm: Mic 56-2926), University of Min- 


nesota, 1956, 283 pp. 
. Mood, A. McF. Introduction to the Theory 

















of Statistics (New York: McGraw-Hi11 
Book Co., 1950), 433 pp. 


. Norton, H. W. ‘‘Calculation of Chi-square 


for Complex Contingency Tables, ’’ Jour- 
nal of the American Statistical Associ- 


ation, i 


XL (1945), pp. 251-258. 
. Snedecor, G. W. Statistical Methods ( Ced- 





ar Falls, lowa: Towa State College Press, 
1946), 485 pp. 


. Stuart, A. ‘‘The Estimation and Compa r i- 


son of Strengths of Association in Contin- 
gency Tables,’’ Biometrika, XL (1953), 
pp. 105-110. 


. Williams, E. J. ‘‘Use of Scores for the An- 


alysis of Association in Contingency Ta- 
bles,’’ Biometrika, XXXIX (1952), pp. 
274-289. 


. Yates, F. ‘‘The Analysis of Contingency 


Tables with Groupings Based on Quanti- 
tative Characterters,’’ Biometrika, 
XXXV (1948), pp. 176-181. 





THE BEST LINEAR ESTIMATE OF THE PRE- 
DICTED VALUE AND THE STANDARD 
ERROR OF THE ESTIMATE 


PALMER O. JOHNSON 
University of Minnesota 


DAVID AND NEYMAN* used the Markoff The- 
orem to derive a general method for solving the 
problem of obtaining the best linear estimate of 
a criterion and the standard error of the esti - 
mate. They gave as an illustration of the appli- 
cation of the theorem the case of one independ- 
ent and one dependent variable, say X and Y. 
For this case there is little difficulty in calcula- 
tions even with raw scores. For more thanone 
independent variable, say X,, X2,...,Xg, how- 
ever, we suggest the use of deviation scores to 
save much computing work in deriving formulas 
for the problem we consider here. Our purpose 
is to obtain a general form for the estimation of 
Y, from a fixed set of more than one independ- 
ent variable, i.e., from X,, X2,...,Xg, and 
the determination of the standard error of such 
an estimate. 

There are two situations that need to be dif- 
ferentiated with respect to the predicted value: 


1. The estimate of the true mean Y-score, 
say, the average of all predicted j’s, or E (¥)’s, 
of all individuals in the population who have 
fixed, identical values of X,, X2,...,Xg. 

2. The true Y score for any single predicted 
y score. 


It is situation (1) with which this paper deals. 

We first summarize the notations and form- 
ulas used by Davidand Neyman. Several 
changes were made to fit our problem in gen- 
eral without distracting from the original mean- 
ings. 


Define, th 
Yj the Y score of the i~’ individual 
Xij the X, score of the i) individual 


Xi = the Xs score of the ith individual | 


When i= 1, 2,... n; n denotes the number of in- 





dividuals. Assume also that X,, X,,...,X,gare 
independent variables and that the relationship be- 
tween Y and each X is linear. 


Define again, 
yi Yj 
Xij 


When yj, X:i, X2ji,---,Xgj are deviation scores 
of Y, X,, X,,.--,Xg, respectively, for the ith 
individual, a Hes x. .,Xg are mean scores 
of Y, X,, X2,..-,Xg, respectively. With these 
conditions, we may write ¥j = E(¥j) = aj,p,+...+ 
ais Ps + 4j(s+1) P(s+1) (3) 


Where § or E(¥j) denotes the mathematic al 
expectation of yj; p, and pg, are to be considered 
as unknown parameters. For our special case, 


ai, 


4i(g+1) 


By using the Markoff Theorem, we obtain J = 


A 
EG) = —" (5) 


Hy = - (6) 


when by 


ror for the estimate of each individual; n denotes 
the number of individuals involved, s is the num- 
ber of independent variables. Theterms A , As, 


Oo, and A , are the determinants as indicated 


, denotes the square of the standard er- 


on the next page. 


#F. Ne David and J. Neyman. "Extension of the Markoff Theorem on Least Squares," Statistical Research 


Memoirs, II (December 1938), pp. 105-106. 








JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


Giz tees Gs Gi(s+41) 


Goa osee Ges G2(s+1) 


Gs, pee Gs(s+:) 





G(g+i): tees G(s+1)(s+1) 





n 
where Gi - &, Py 44 ik 


For our problem, pj = 1 


0 b(s+1) 
H, Gi(s+:) 


G2(s+:) 


Gs, Ggs Gg(s+:) 








G(s+i): G(s+i)e G(s+:\s+:) G(s+i s+.) 


n 
Hj = es yj aij 


For our problem, 





March, 1957) 


H(s41) 
Gi(s+1) 


Ga(s+1) 





Hs Gs, Gs2 sees Gs(s+:) 


H(s+1) G(s+i): G(s+i)2 «++: G(s4iXs+:) 








= 2 
where Ho= 2 m4 ¥i 


For our problem, p= il 


All the other notations are the same as indicated in (7) and (10). 


b, bs cess D(s+1) 
Giz tees Gi (s+1) 


G2, Goa tees Ga(s+:) 


Dg Gg; Gg(2) cove Ggs Gg(s+1) 


D(s+:) G(s+i): G(s+i)e ---- G(g+1)s G(g4i s+) 








where all the notations are the same as indicated in (7) and (10). 


Before we start to work out our problem, we give some elementary formulas which will be useful in 
writing our solution in more general form: 


Fyg - Tis - Tas PiyTialay 





Tiy.s-Tia.alay.s 


Tiy.as * 
4 Vi-rh.s VI-rfy.s 








1-Ri 25 = (1-r?,)(1-r35.2) 


1-RY..2 = (1-r7y)(1-rzy ,) 
y.12 y . 
1-RS sas = (1-rfy)(1-rBy,s)(1-rBy, 12) — oe 
27)_,? 2/1 _ +2 
ONLY es) tc OALRe 5) 
03(1-R§. 123) ee 





236 JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


where 1, 2, 3 denotes x, ,Xz,X;3. 
Now we consider the estimate of Y from only one variable X. By (1), (2), (8), (11), and(15), we have: 


= Ix = 0, = Gaz= =x’ = nox 
(19) 


= Zy= 0, Hz= xy = nryy0xdy 


From (7), (10), (14), and (17), we have 


A 


2 2 
n?oxoy (1 -T xy) 





From (5) we obtain: y= Txy %y 
x 


If we want to estimate Y from X we simply use definition (2) and easily get 


y* y + xy Fy (X-X) 
ox 


From (6) we obtain* 
be = ay(t-ryy) [1+ (K-X)*) or Hy =oy.~ (t+ (X-X)? 
n-2 % n-2 o% 
=—~ 
Next we consider the estimate of Y from two independent variables X, and X, 


* By elementary statistical knowledge it is not difficult to show that ne = hg. ‘lence this forma can also 


be used to evaluate the standard error of estimate for Y score. ‘his lds Similar situations arise 
in this paper. 














March, 1957) 


Following the same procedures, we have** 


n 0 0 


A 0 2 = n*o?03 (1-r?,) 


xX, X2 


0 0 


. . -n0,020y [ O2Xi(Tiy-lislay) +O,Xa(Poy-Fialiy)] 
not NP 29,92 


(27) 








NP 429,02 no, 


NT, y%,0y NPgy202y 
0 0 

. = n* ofozoy {(1-ri,)-rify-riys 2rwetiylay } (28) 
no} NY 120,02 


NT ,20,02 nos 








xX, X2 
0 0 
= -n? {(1-ri,) + 03x, + Oi Xy ~ 24294 02% Xs } (29) 


2 
no; NF 429,02 


NF 20,02 no 








substituting the values of A, A @ 4 o, and A, into (5) and (6), we obtain, by using formulas in (18): 


%=Tay.2 Sy? X, + Tay Py. x, 
01.2 02.1 


Fy.2 0 


oy" y+ Tiy-2 Oa (X,-X,) + Tay.2 (X,-X,) 


02.4 


X,-k,)° x,-%,)’ 


2 o2 (1-R2, 2) 1 | ,-X,) (X,-X,) (X, -X, (XK, -X | 
PF(y) * +__f ( * T-rf, of ° oF -2r,2 0,0% " 











2 
”" »F(y) : Of 2 ‘ of, - 2s 91.2 G2. 


Oyarz [ , [%-Ki)? | (K2-%,)? adh 
n- 


Then we consider the estimate of Y from three independent variables X,, X,, X;. 


“For simplicity, we use subscripts 1, 2,'3, ..., 3 to denote x), x0, X3y eoey X%q in writing r and og. This 
holds throughout the paper. 





238 


JOURNAL OF EXPERIMENTAL EDUCATION 


Following the method used before, we have 


0 
no* 
A 


NY 429,92 


NF 439,95 


0 1 
0 n 
NT, yi 0y 0 
NT 2 yI2%y 0 


NE 3 yI30y 0 





= -n*0,0205 [ 0205x, {r,y(1-r§,) -Fay(Tiz-lislas) - Pyy(Fis-Tizlas) } 


2 
ndy 0 
0 n 
AF, v7, Oy 0 


NT 2 yI29y 0 





AEs yIs0y 0 


-_ =F. 
=n* 0} 030505 


0 
NF 429,92 
nos 


NP a39295 


xy 

0 

no? 

NP 429,92 


NT 439,95 


+0, 05X2f 


0 
waren i 2 
= n'0,9205 [ l-r2.-rfy-rig+2ry2lisP 2s] 


NP a39293 


no4q 





Xe Xy 
0 0 
NF 429,92 NF 430,03 


no3 NT 939203 





2 


ray(1-ris) - Piy(lia-lislas) - Py y(T2s-Tizlis)f 


+ 0, 0gX54 ryy(1-ri2) . Pyy(Tis-P izle) * Pay(l2s-Ti2lis)} 


NT, yO, 0y 


0 

not 

NP 429,92 
P4399 


(1-r3,-r 


+ 2r,yP2 





LL. 


2 2 2 
is“Tay + 2Py2T 3823) 


os riy(1-r3s) ° riy (1-r?,) - roy (1-r7.) 


NT 2 yF20y 
0 0 


NPs yI39y 
NP 429,02 NT ,30,05 

2 
NnOe 


Nl a392 Oy; 


Nf23920, nos 





y(Tiz-Tis las) + 2r, yay(lis-Pi2las) 





+ 2raylay(las-Tizlis)| 





x, 

0 

no? 

NF 129,02 


NT 439,05 


Xe 

0 

NF 429,92 
nos 


NT 939205 


Xy 
0 
NI 439,03 
NI 239205 


2 
no; 





(continued next page) 








March, 1957) JOHNSON 


a 3 2 ’ 5 2.3.2 2 2.2.3 2 
= (1-Pya-Tis-Tas + 20 yeTislas) + 0203X,(1-rg3) + 0, 03XQ(1-rj5) 


222 2 2 
+ OF03x5(1-riz) - 20,0205%,X2(Fig-Tislas) - 20,0305X,X5 (yy -T yal as) 


2 
~ 2050205X2X3(Fa3-Ti2l is) 


After more troublesome algebraic manipulations, we obtain: 


0 0 0 
x y-23 y-13 12 
¥ ~ Try-zs Ss .a0 + Fay.is G— Xa tay. Xs 


“y: (X,-X,) + Pay..2 Oy.12 (X-Xs) 


- Oy. 
or Y=¥+ riya ee (X,-X,) + ray. as 
. 2.13 Os.12 


o%(1-Ry. sas) 1 i-r? = 1-r3 » 
—_ y y-123 23 E: 2 3 E: 2 
aoe A l-rie-l'fs-Tis+20 yal isl as of (Ks -¥1) o% (X2-¥a) 





2 
1a (x, -&,)? 
a4 


2(rjye-Tisl = -_ 
ie fata) (X,-X,)(X,-Xz) 


Ma Tata) (x, -X,)(X,-%;) 











. 2(ras-Faatas) (x, -¥,)(X,-Xs) J 


929% 





2 2 2 
ee (X,-X%,)* | (X,-K,)* _ (K3-X;)* _ (X,-X,)(X,-K,) | 
NY n-4 i+ oF .23 08.13 04.12 2ris.9 O,.a3 Faas 


(X,-X,) (Xs-X3) | Sites (X,-X,)(X5-X;) 


O1.23 3.12 02.13 O3.12 





- 2ri3.2 











From all the above, we can finally derive the formulas for ¥ and by in general form as follows: 


~=Y¥+ : r Oy.jk... (X,-%)) 
icc. wee *°* 


2 
ga (X;-X,)* X\-%\)(K,-%, 
Hy = wie ahd ini-kpl me ae Tij. kl... we Ban a tw 
j<k<1<... if jAkAl#... 7%. jl... %j. ikl... 
for r:i<j, k<l<... 
1. jul... for o: j<k<l<... 








where i=1, 2,..., 8; j=1, 2,..., 8; k=1, 2,...., 8; 1 =1, 2,..., 8; n denotes the number of indi- 
viduals and s denotes the number of independent variables. 








THE “EQUATING” OF NON-PARALLEL TESTS’ 


WILLIAM H. ANGOFF 
Educational Testing Service 
Princeton, New Jersey 


THE PURPOSE of this paper is to explore the 
feasibility of establishing a unique conversion 
table for the purpose of translating scores on 
one test to scores on another test of different 
psychological function. In particular, this prob- 
lem will be examined with reference to the Scho- 
lastic Aptitude Test of the College Board, re - 
ferred to here as the SAT, and the American 
Council on Education Psychological Examination, 
referred to here as the ACE. Thefirst portion 
of this paper will attempt to define the sources 
of error in such a table, and will draw some 
conclusions regarding their implications for a 
table of comparable scores. The second por - 
tion will present some empirical evidence re 
garding some difficulties in setting up such a 
table. 

In discussing the implications of developing 
tables of comparable scores, it will be profit- 
able to draw the distinction between parallel and 
non-parallel tests. In the conversion of scores 
from one form to a parallel form of atest, such 
as one form of the ACE to another, or one form 
of the SAT to another, there is simply the prob- 
lem of transforming the system of units. In such 
a case, we can consider the problem as directly 
analogous to the problem of conversion from 
centimeters to inches, from Centigrade to Fahr- 
enheit, from pounds to grams, etc. Since the 
two kinds of measures in each case involve 
identical functions, the system of conversion is 
unique; there is one and only one conversion 
equation. On the other hand, the question of con- 
verting scores from different tests is another 
matter. Here the problems include those that 
are involved in setting up a table of equivalent 
scores for tests of identical function, and in ad- 
dition include the difficulties resulting from lack 
of equivalence of function of the two tests in ques- 
tion. 

It is possible to outline briefly the following 
sources of random error in a table of equivalent 
scores for tests of identical function. One 
source of error lies in the unreliability of the 
measuring instruments themselves, and affects 
the stability of the statistics obtained in the col- 
lection of data. A second source of error lies 
in the choice of samples used to establish the 
conversion line. Different samples will yield 
lines that differ randomly from one another. A 
third source of error in equivalent scores lies 
in the design of the equating experiment and the 





method of treating the data. The various meth- 
ods of equating are not equally reliable. 

In the problem of equivalent scores that are 
derived from parallel forms of a test, there is 
by definition a unique ‘‘true’’ line. Errors re- 
sulting from the unreliability of tests, from the 
sampling of people, and from design and m eth- 
od will only be random errors. The problem of 
‘‘comparable’’ scores, on the other hand, that 
is, of converting scores from different tests, is 
more complex. Here, the psychological func- 
tions measured are different, so that the prob- 
lem is not simply one of transforming units. 
To use the physical analogy again: the problem 
essentially calls for a conversion from height 
to weight. However, the statement of this prob- 
lem rests on an inappropriate premise, that 
there is a single conversion system —and this is 
obviously not true. While it is certainly true 
that one can predict or estimate by regression 
equations the most likely weight of an individual 
of a given height, or predict scores on the SAT 
from knowledge of scores on the ACE, itisclear 
that in neither case is the prediction to be con- 
sidered a transformation of unit systems. Fur- 
thermore, just as in the case of regression lines, 
conversion lines which purport to translate 
scores across tests of different function will, in 
fact, differ systematically, that is, they will dif- 
fer in a predictable fashion, depending on cer- 
tain crucial considerations. Some of these con- 
siderations can be outlined as follows: 


1. Different lines will result when the meth- 
odological definitions of comparability are dif- 
ferent; and such definitions must in any instance 
be purely arbitrary. For example, one might 
ask what the most likely score would be on y, 
given a particular score on x. This would be 
one definition. Or one might equate stan dard- 
score deviates on the two tests, or equivalent 
percentiles. These would be other definitions. 
Or one might, for example, equate scores that 
yielded equal predicted scores on one or another 
criterion. In this instance, of course, the use 
of different prediction criteria would yield dif- 
ferent equations. There is probably an unlimit- 
ed number of other definitions, each yielding a 
different conversion line. In contrast, al] meth- 
ods or definitions of equivalent scores in the 
case of parallel forms lead to the same conver- 
sion equations, i.e., except for random error. 


*This paper was presented at the 195 meetings of the American Psychological Association. 





242 JOURNAL OF EXPERIMENTAL EDUCATION 


2. Different lines will result when different 
populations form the basis for deriving the ta - 
bles of comparable scores. The line of rela- 
tionship between two tests of different function 
will be one for, say, male engineering students, 
and quite another for, say, female arts students. 
This fact is immediately apparent if one were to 
examine mean scores of boys and girls on the 
Verbal and Mathematical Sections of the SAT. 
Girls characteristically score slightly higher 
than boys on Verbal, but they score cons ider- 
ably lower than boys on Mathematical. The 
‘‘comparability’’ line (or conversion line) be- 
tween Verbal and Mathematical scores as de- 
rived from male data would then be much differ- 
ent from that derived from female data. Conver- 
sion lines for tests of different function must, 
in fact, be different for groups that show differ- 
ent types of profiles on the tests, since the con- 
version lines themselves are a way of exam in- 
ing profiles on two tests, and therefore they 
must be normative in character. 

3. Differential selection will have a pro- 
nounced effect on conversion lines. Let us say, 
for example, that we collect data on the SAT and 
the ACE for a group of individuals inorder to set 
up a conversion table to go from scores on one 
of these tests to scores on the other tests, and 
let us say further that these individuals have 
been explicitly selected on the SAT. The distri- 
bution of resultant SAT scores will, of course, 
be sharply curtailed. The distribution of ACE 
scores, On the other hand, will only be moder- 
ately altered. A conversion line derived over 
such data will be far different from a line der- 
ived over a group of unselected students. In fact, 
even if students used for data collection have 
been selected on an outside variable, say high 
school grades, their resultant distributions on 
the two scores in question will be different. 
More specifically, they will be unequally cur - 
tailed in proportion to the relative correlations 
of the two scores with the selection variable, 
with the result that the mean and variance of 
one score will be more drastically altered than 
the mean and variance of the other score. A line 
of relationship derived even from these data 
would have limited value. In contrast, scores 
taken from parallel tests will be equally affect- 
ed by selection on an external variable, withthe 
result that there will be no systematic effect on 
the conversion equation. 

It should be emphasized here that the forego- 
ing problems inherent in the development of a 
table of comparable scores do not replace or pre- 
clude the kinds of error-sources that were cited 
earlier in connection with equivalent scores that 
are taken from parallel forms. The problems 
of comparable scores exist in addition to the 
problems of equivalent scores. 

Added to the systematic effects of (1) def ini- 





(Vol. 25 


tion and method, (2) choice of population, and 
(3) selection, there are other problems, namely 
those of interpretation in considering a table of 
comparable scores. For example, there is a 
constant danger that the converted scores will be 
assumed to possess the reliability and the valid- 
ity of the scale to which they are converted. 
While it should go without saying that ACE scores, 
however well they are converted to the SAT scale, 
are still ACE scores and retain all the basic char- 
acteristics of the ACE, many test users will con- 
sider that once converted, they should behave in 
all respects like SAT scores. Obviously this is 
unjustified. The reliability and validity of ACE 
scores that are converted to the scale of the SAT 
will be no different from the reliability and valid- 
ity of the original unconverted ACE scores, and 
will not resemble SAT data in the slightest. Also, 
if regression methods are used for establishing 
the conversion line, then the predicted SAT 
scores will not have the same distribution as or- 
iginal SAT scores; the predicted SAT scores will 
have a regressed distribution. The pooling of 
these two kinds of distributions, taken from actu- 
al and predicted scores would therefore be inap- 
propriate. And yet it is for the very purpose of 
pooling such data, or considering them in con- 
junction, that such conversion tables are ordin- 
arily called for. And again, if regression meth- 
ods are used, then the prediction of grades from 
predicted SAT scores will involve double regres- 
sion while the prediction of grades from original 
SAT scores will involve only single regression. 
These two types of systems should not be merged, 
although the need that has been expressed for de- 
veloping such a conversion is, essentially, to 
merge the systems or (which is equivalent in this 
context) to compare directly scores taken from 
different tests. 

To point up the fact that the SAT and the ACE, 
although presumably both measures of general 
scholastic aptitude, are actually not equivalent 
forms, it will be of interest to examine the dif- 
ferences in item content and the correlations be- 
tween the tests. The SAT is, for one thing, a 
three-hour power test; the ACE is a one-hour 
speeded test. The SAT contains verbal analo- 
gies, verbal opposites, verbal completions, and 
reading comprehension; the ACE contains verbal 
analogies, same-opposite items, and definitions. 
The SAT contains arithmetic reasoning items, 
and algebra and geometry items involving ele- 
mentary principles; the ACE contains arithmetic, 
number series, and spatial items (figure analo- 
gies). At least from a cursory point of view, it 
would appear that the designs of these two tests 
are not entirely alike. 

The correlations among the parts of the SAT 
and the ACE and their reliabilities are given in 
Table I of the reference material. They are av- 
erage figures taken in part from the booklet 








March, 1957) 


‘*‘College Board Scores’’, published by the Col- 
lege Entrance Examination Board, andalsofrom 
validity and equating studies conducted by Educa- 
tional Testing Service, and are estimat- 
ed for unselected candidate groups. The rel ia- 
bilities for the SAT are internal-consistency co- 
efficients (KR-20); the reliabilities for the ACE 
are modified parallel-forms coefficients. The 
numbers in parentheses are estimated cor rela- 
tions between true scores. Admittedly, the fig- 
ures are only rough estimates; they are given 
only for purposes of discussion. However, it is 
felt that these estimates are not substantially 
different from what would be found in acarefully 
controlled study. 

It is observed in Table I that none of the esti- 
mated correlations between true scores is very 
close to unity, as would be the case if the tests 
were parallel. The highest true-score correla- 
tion, .90 for SAT-Verbal vs. ACE-Linguistic 
still shows almost one-fifth of the true-score 
variance unaccounted for. Consequently, we can 
be sure that the considerations normally app1i- 
cable to the problem of equivalent scores are not 
applicable in the present situation, since these 
tests are not parallel. 

It is suggested that considerable attention be 
given to the non-unique character of tables that 
purport to translate scores across tests of dif- 
ferent function. While there are no data avail- 
able that have been collected for the express pur- 
pose of demonstrating this non-uniqueness, some 
data are available and can be used as reference. 
For the most part they were collected for val id- 
ity purposes, and do not quite fit the kind of de- 
sign most desirable for demonstrating some of 
the points discussed above. However, withinthe 
limitations of these data, it will be seen that 
groups that would normally be expected to yield 
different profiles on the SAT and ACE doin fact 
yield different lines of comparability. It is pri- 
marily to the objective of multiple tables of com- 
parable scores that the present data are ad- 
dressed; and it is suggested here that as many 
tables need to be prepared as there are separate 
definable groups of individuals whose profiles on 
the two scores will be expected to be different, 
and who will normally be given the tests. 

The present data were collected at five col- 
leges: three liberal arts colleges referred to 
here as Colleges A, B, andC, a state university 
(College D) and a technological institution (C ol- 
lege E). For the purposes of the present report 
the data are separated into two sets, those for the 
arts groups at Colleges A, B, andC, and those 
for four groups at the state university (College 
D). (The data taken from College E are based 
on ACE-Total score and will be considered later.) 
The separation into these two sets is occasioned 
by the fact that an unknown edition of the ACE was 
given at the state university (D). Consequently, 


ANGOFF 





243 


those data cannot be shown simultaneously with 
the data from the other three schools 

Two groups of male liberal arts students are 
studied at College A, those entering college in 
1949 (N = 398), and those entering college in 1950 
(N = 448). Both groups took the 1949 form of the 
ACE College Edition. The form of the SAT prob 
ably varies within the groups, but its identifica- 
tion is of no consequence in these data, since all 
forms of the SAT are related to a common scale. 

One group of 141 male liberal arts students is 
availabie trom College B. This group entered 
college in 1948 and took the 1948 form of the ACE. 

One group of 154 students is available from 
College C, and comprises male students in liber- 
al arts and female students in a nursing school 
curriculum. Their year of college entrance was 
1951; they took the 1948 form of the ACE. 

Since conversion equations are available which 
permit the transformation of scores from the 1948 
to the 1949 form of the ACE, it was possible to 
combine the foregoing four arts groups in one 
presentation. 

Four groups of students are available at Col 
lege D, the state university: 333 male liberal 
arts students, 582 female liberal arts students, 
134 male students in Economics and Business, 
and 75 male engineering students. All students 
entered the university in 1947. The form of the 
test taken is unknown. The information available 
is that the ACE raw scores were converted to 
percentile scores and then transformedtoa norm- 
alized scale with M = 13 ando = 4. 

A separate set of data are available for 305 
freshmen entering the technological institution 
(College E) in 1950. For these students only 
SAT-M and ACE- Total (1948 Edition) scores are 
available. Consequently, these data are given 
separately from the rest. 

Unfortunately, detailed knowledge is not avail- 
able regarding the selection procedures at the 
five colleges. It is known that the ACE was tak 
en after admission at College A; the SAT was tak- 
en before admission in the regular CEEB series 
However, SAT scores were reportedly not used 
for selection for the 1949 entering group, and 
only to a ‘‘limited extent’’ for the 1950 entering 
group. At College B the SAT was given after en- 
trance; nothing is known regarding the date of the 
ACE administration. At College C both tests 
were given after entrance. This appears to be 
similarly true at College D. Finally, at College 
E both tests were given after entrance; selection 
was made on the basis of high school grade. 

It appears likely that in no case were the stu- 
dents selected explicitly on either the SAT or the 
ACE. 

Seven tables are given, all derived simply by 
equating means and standard deviations. The 
first three tables describe the conversions ob - 
tained for the arts groups—two at College A, one 





JOURNAL OF EXPERIMENTAL EDUCATION 





TALLE I 


ESTIMATED CORRELATIONS, TRUE-SCORE CORRELATIONS, AND RELIABILITIES FOR 
SAT AND ACE 





SAT-V SAT-M ACE-L ACE-Q Reliabilities 





-60 (.65) .82 (.90) a ‘ 94 
“42 (.48) 3 (. 90 

.88 

-86 





TABLE II 


CONVERSIONS FOR COLLEGES A,B, AND C-—ACE-L vs. SAT-V 





SAT-V Scores 


College A College A Range of Converted 
(1949 Group) gee Group) College B College C Scores 
= = N = 141 N = 15 


236 239 2h7 253 17 
376 376 385 395 19 
516 512 522 538 26 
656 649 659 681 32 
196 786 196 Bek 8 

















TABLE III 


CONVE?SIONS FOR COLLEGES A, B, AND C---ACE-Q vs. SAT-"' 





ACE-Q Scores SAT-M Scores 


College A College A Range of Converted 
(1949 Group) (1950 Group) College B College C Scores 


N = 398 N = 48 N= 14) N= 15h 











258 232 214 208 50 
4ok 399 376 377 28 
550 566 538 545 28 
697 733 699 713 36 
(843) (900 ) (870) (882) 














March, 1957) ANGOFF 


TABLE IV 


CONVERSIONS FOR CCLLEGES A, P, C,—-@ACE<Q vs. SAT-V 





ACE-Q Scores SAT-V Scores 


College A Callege A Range of Converted 
(1949 Group) (1950 Group) College B College C Scores 
N = 398 N = 448 N= 1h] N= 15h 
(185) 225 (195) (179) 46 
355 379 35 346 33 
526 532 
697 686 
(867) (840) 











TABLE V 


CONVERSIONS FOR COLLEGE D---ACE-L vs. SAT=V 





SAT-V Scores 


Range of Converted 
Men Arts Women Arts Men EAB Men Eng. Scores 


N = 3335 N = 582 N= 134 WN = 75 








275 273 285 202 83 
416 4o6 412 388 28 
557 540 540 574 34 
698 674 668 760 92 
(839) (807) 795 (946) 151 


*Special scale used by College D. 





TABIE VI 


CONVERSIONS FOR COLLEGE De---ACE-Q vs. SAT=-M 





ACE-Q Scores* SAT-M Scores 
Range of Converted 


Men Arts Women Arte Men E&B Men ss Scores 








(193) (177) 246 261 104 
359 323 382 42g 106 
525 LAG 519 5TT 108 
691 615 656 725 110 

(857) 761 192 112 


*Special scale used by College D. 








JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TAPLE VII 


CONVERSIONS FOR COLLEGE De--ACE=Q vs. SAT-V 





ACE-Q Scores* SAT-V Scores 
Range of Converted 
Men Arts Women Arts Men E&B ° Scores 
N = 333 N = 582 N = 134 








(149) 214 243 
337 392 570 
525 570 498 
713 748 626 

(901) (926) 153 





*Special scale used by College D. 


TABLE VIII 


CONVERSIONS FOR COLLEGE E (N - 305) 
ACE-TOTAL vs. SAT-M 





ACE-Total 


Scores 


20 
60 
100 
140 
180 











March, 1957) 


at College B, and one at College C—separately 
for converting ACE-L to SAT-V, ACE-Q to SAT- 
M, and ACE-Q to SAT-V. (The last of these 
three is admittedly not one which would ordinar- 
ily be called for; however, it is considered use- 
ful in the present context in order to illustrate 
the effect of conversion with two tests that are 
known to have low correlation.) The second 
three tables parallel the first three except that 
they have been drawn up from data taken from 
College D for four different groups: Men inArts, 
Women in Arts, Men in Engineering, and Men 
in Economics and Business. The last table de- 
scribes the conversion from ACE-Total Scores 
to SAT-M, and is based on data collectedat Col- 
lege E. 

The three tables for the groups at Colleges 
A, B, and C describe a series of conversions 
which show a reasonable amount of agreement, 
but less agreement, it is judged, than would be 
found if the two tests were parallel. On the oth- 
er hand, they are not sodisparate that they would 
not be useful to colleges in providing rough 
score transformation. 

It is observed in the foregoing three tables 
that the largest range of converted scores is 50 
(in the ACE-Q vs. SAT-M conversion) and that 
the smallest is 17 (in the ACE-L vs. SAT-Vcon- 
version). In general, the ACE-L vs. SAT-Vcon- 
version shows the smallest amount of disagree- 
ment for the four groups of students, possibly 
because the correlation between those two scores 
is the highest of the three. In general, itis pos- 
sible that disagreements of the magnitude ob- 
served here are not excessive, and could be tol- 
erated. It is maintained that the small amount 
of disagreement among the lines is accounted 
for by the fact that the four groups in question 
are relatively homogeneous in character. 

In contrast to the foregoing tables. Tables V, 
VI, and VII, show quite a different picture. 
These tables give similar types of information 
for the groups at College D, the state college. 
These groups are not considered to be homogen- 
eous. As maintained above, they include liber- 
al arts men, liberal arts women, menin eco- 
nomics and business, and men in engineering. 

It is seen that the ranges of converted scores 
for the groups of students at College D are far 
greater than for the groups at Colleges A,B, and 
C. In spite of the fact that the groups at College 
D are all students in one college, their conver- 
sion lines differ more than the lines for the lib- 
eral arts students who come from different col- 
leges. In respect to the type of conversion ob- 
tained, there is greater within-college variabil- 
ity than between-college variability. It is the 
judgment of the writer that this finding is wholly 


ANGOFF 





247 


explainable by the differences in type of curricu- 
lum within the college. Liberal arts students 
show conversion lines that are characteristical- 
ly different from those of engineering students, 
and male students show conversion lines that 
are characteristically different from those of fe- 
male students. These differences occur when 
the tests in question are not parallel forms. 

Finally, it will be of interest to examine the 
conversions obtained with data from College E, 
the technological institution. These conversions 
are summarized briefly in Table VII. 

It is striking to note in Table VIII thatan ACE- 
Total score of 100 is found to be approximately 
comparable to 500 on SAT-M. If one were to 
generalize this conversion to all possible types 
of populations, it would be necessary to conclude 
that the norms group for the ACE, which norm- 
ally averages 100 or so on the ACE, will score 
at 500 on SAT-M, and is therefore at the same 
general level of ability as the College Board 
standard group. However, in the light of the com- 
position of the College Board college population 
and also of the composition of the usual ACE 
norms group, it must be admitted that the con- 
clusion of similarity of general ability inthe two 
populations is untenable. Consequently, the 
above conversion cannot be considered a general 
law operating for all possible types of student 
populations and must be considered specific for 
the group studied, or at least for similar groups. 


Summary 


The present paper has addressed itself pri- 
marily to the argument that conversion lines 
which relate tests of different function are pri- 
marily normative in character, and will not have 
generality of meaning. Different groups of indi- 
viduals with their own characteristic back- 
grounds, curricular emphases, andinterests 
will yield quite different types of conversion 
systems—and these conversion systems will re- 
flect the differences in these groups. Conse- 
quently, it is considered erroneous to speak of 
‘a conversion line’’ for tests of different func- 
tion. Other errors of bias in addition to those 
resulting from differences in groups also accrue 
in this problem of non-parallel tests, such as 
those related to the effects of selection and those 
related to methodological definition—errors 
which do not appear in the problem of equivalent 
scores that are derived over parallel forms. 
Finally, errors of interpretation exist in the con- 
version of scores on non-parallel tests whichare 
not present in the conversion of scores on paral- 
lel forms. 











THE COMPAR A BILITY OF THE BI-FACTOR 
AND SECOND-ORDER FACTOR PATTERNS 


JOHN SCHMID, Jr. * 
Personnel Research Laboratory 
Lackland Air Force Base, Texas 


IN RECENT years a variety of tools for the 
multivariate analysis of psychological data have 
been developed (4). Among these, factor anal- 
ysis has been used quite extensively. Unfortu- 
nately, however, factor analysis has been char- 
acterized by a multiplicity of forms, each yield- 
ing a different type of solution. The end of the 
proliferating process which has given rise to 
this assortment of solutions is not yet in sight, 
for new types of solutions are being developed 
yearly. The net result is that the research work- 
er who would like to summarize a large group 
of intercorrelations in a meaningful way is con- 
fronted with apparently divergent methods and 
solutions. 

It is time that we stop and take stock of some 
of the similarities and differences of various 
factor methods with the view that some unifica- 
tion is possible. It is the intent of this pa - 
per to demonstrate that under certain condi- 
tions two of the most popular solutions are ident- 
ical except for computational methodology; that 
is, the bi-factor solution proposed by Holzinger 
(3) is identical to the second-order solution pro- 
posed by Thurstone (6). 

In the bi-factor solution, the factors are or- 
thogonal, consisting of one general factor—more 
appropriately designated as a basic factor by 
Burt (1: 46)—and group factors. The common 
factors of a nine variable bi-factor solution 
might appear symbolically as follows: 

General 
Variable Factor 








Group Factors 
wie 


piel 


x 
x 
x 


xxx MRK KK KIC 


OKBAIMRNSWNHe 


Another type of solution which has received 
widespread use is oblique simple structure. This 
has become popular because the patterning of 
the factor loadings gives the appearance of easy 
psychological interpretation. However, corre- 
lations between the factors offset this apparent 





gain in ease of interpretation. Thurstone, in rec- 
ognizing this defect, offers an alternate solution 
in which the intercorrelations between the pr i- 
mary factors are factored completely. This com- 
plete solution, then, is used asa rotation matrix 
for transforming the oblique solution into an or- 
thogonal solution in which the group factors of 
simple structure are maintained. However, ad- 
ditional factors are produced which are a deriva- 
tion of the original oblique factor intercor rela- 
tions. 

When one second-order factor exists, this 
solution resembles the stepladder pattern of the 
Holzinger bi-factor loadings. Thomson (5: Ch. 
XIX) illustrates the development of a complete 
second-order solution very clearly. 

A survey of the three tests by Holzinger (3), 
Thurstone (6), and Thomson (5) has shown no ex- 
plicit statement to the effect that the bi-factor 
solution and the ynd-order solution are ident- 
ical, although st..cments tacitly suggest sim i- 
larity of the two solutions. The purpose of this 
study, therefore, has been to gather evidence 
about the identity or similarity of the two solu- 
tions. é 

Holzinger has shown that the various factor 
solutions merely represent different sets of ref- 
erence axes placed in the common factor space 
of the test variables, and he shows that transfor- 
mation matrices exist which will carry one solu- 
tion into another. He presents an i] lustration 
(3:257-258) whereby an oblique primary pattern, 
which is related very simply to a simple str uc- 
ture (2), is transformed into a bi-factor solu- 
tion. He indicates (3:258) thata primary solu- 
tion tends to have smaller communalities than 
its corresponding bi-factor solution because its 
common factor space is one dimension smaller. 
Question, however, arises if the number of di- 
mensions in a common factor solution does af- 
fect the communalities. Inthe following example, 
some evidence is presented to show that, rela - 
tive to the model used in this example, the com- 
munalities are unaffected by the added dimension 
found in a bi-factor solution. 

In order to investigate the identity of the bi- 
factor and second-order solution, a model cor- 
relation matrix was constructed. It was believed 
that an empirical correlation matrix, definitely 
having only one second-order factor which would 


*Air Force Personnel and Training Research Center, Air Research and Development Command 





peyjiwo sjulod [eu1seds 





‘Gl 


Il 


Zz 
js) 
& 
< 
oO 
~ 
Q 
al 
4 
< 
& 
Z. 
= 
io 
a 
a 
x 
a 
~ 
° 
=) 
< 
Zz 
& 
=) 
°) 
5 








.dQd = 4 
«STIAVIYVA JO SNOLLV 1 FZHHOOURLNI 


I TIadVL 





March, 1957) 





TABLE I 


PRIMARY PATTERNS 





Variable 





1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 





1.00 








JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE III 


TRANSFORMATION MATRIX 











TABLE IV 


SECOND-ORDER FACTOR SOLUTION 





Variable 





COIS MLSwone 




















March, 1957) 





be useful for this study, might be difficult to find 
in the literature. In addition, sampling errors 
of various types might tend to conceal a real 
identity under mere similar guise. Therefore, 
a model correlation matrix (Table I) was con- 
structed by using the primary factor solution, P, 
and a one-common factor matrix of intercorre- 
lations between the primary factor, @ (Table II} 
that is, the operation, POP’, was performedto 
produce the correlation model. The intercorre- 
lations between the factors, @, were factored 
by the usual centroid method. One commonand 
three unique factors were produced. These are 
reported as matrix E' (Table III). This matrix, 
E', is used as a transformation matrix for ro- 
tating the primary pattern, P, into a second-or- 
der solution. This appears as matrix B (Table 
IV). This second-order solution was computed 
in the usual way for constructing a second-order 
solution (6:Ch. XVIII; 5:Ch. XIX). At this point, 
we have a second-order factor solution which re- 
produces side correlations and communalities 
of our correlation matrix model exactly. The 
question to be answered in this study, now, was 
this: If our correlation matrix model, POP’, is 
factored by the Holzinger bi-factor method, will 
a solution be obtained which is identical to the 
second-order factor solution which has just been 
constructed? 

A Holzinger bi-factor solution was calculated 
independently on the correlation matrix model 
following the methodology prescribed by Hol- 
zinger (3). The factor loadings were found to be 
identical to those in Table IV. Thus, it was es- 
tablished, within the framework and limitations 
of the type of model set up in this problem, that 
there is no difference between the two types of 
solutions except for methodological procedure. 
This seems to reconciliate the apparent differ- 
ences between the Holzinger and Thurstone 
school of thought on factor analysis. It should 
be noted that the communalities for the primary 


SC HMID 











solution and the bi-factor solution are identical; 
sO, apparently, communality is not a function of 
the number of dimensions used in obtaining 
either a simple structure solution ora bi-factor 
solution. 

A second-order factor solution becomes con- 
siderably more complicated when more than one 
second-order factor is present. Speculation 
suggests that the factoring of higher order levels 
should be done successively with each resulting 
transformation, Ej, used to build a transforma- 
tion matrix which will rotate each higher order 
oblique solution into bi-factor form 


j REFERENCES 


1. Burt, Cyril. ‘‘Group Factor Analysis,’ Brit- 
ish Journal of Psychol , Statistical Sec - 
tion, I! » Pp. - 


2. Harris, C.W.and Knoell, D.M. ‘‘The Oblique 
Solution in Factor Analysis,’’ Journal of 


Educational Psychology, (1948), pp. 385- 
403. 





3. Holzinger, Karl J. Factor Analysis (Chica- 
go: University of Chicago Press, 1951). 





4. Rao, C. Radhakrishna. Advanced Statistical 
Methods in Biometric Researc ew York: 
John Wiley and Sons, Inc., 1952). 


5. Thomson, Godfrey H. The Factorial an 


sis of Human Ability (New York: Houghton 
Mifflin Co. , 1948). 


6. Thurstone, L. L. Multiple-Factor Analysis 
(Chicago: University o icago Press, 
1947). 


253 











AN ANALYSIS OF VARIANCE OF MULTIPLE 
MEASUREMENTS ON SUBJECTS CLASSI- 
FIED INUNEQUAL GROUPS OF 
ONE DIMENSION 


RAYMOND O. COLLIER, Jr., and CLAYTON L. STUNKARD 
University of Minnesota 


Introduction 


IN EXPERIMENTS where more than one 
measurement is taken on a subject, special an- 
alysis is required because of the lack of statis- 
tical independence among the measurements. 
Typical of such experiments are those in which 
subjects are first categorized into one of sever- 
al different groups. Thereafter, multiple meas- 
urements are made on each subject under condi- 
tions corresponding to all combinations of one 
or more ways of classification. 

Various writers have viewed aspects of the 
multiple measurements problem. The founda- 
tions were laid by Yates (8,9,10) in his early 
payers on the split-plot design which has been 
used extensively in agricultural experimentation. 
Among the more recent work on this general 
problem has been that by Moonan (5,6), Hoyt, 
Stunkard, et al.,(3), Kogan (4), Halperin (2), 
and Collier (I). This paper considers the model 
and analysis of a design in which subjects are 
classified into several groups with possibly un- 
equal frequencies along one classification dimen- 
sion. Then, each subject is measured under all 
combinations of a further two-way classification 
system. 


The Model 


Let us assume that N subjects are classified 
into R groups such that the rth group contains 
nr Subjects with r=1,2,...,Randz ny = N. 


For identification purposes this dimension, ex- 
perimental or not, is designated classification 
dimension ‘‘r’’. Now, let the subjects within a 
group be designated by the subscript s with s= 
1,2,...,nr. Thus the subject is observed un- 
der all the combinations of P categories, repre- 
sented by subscript p(=1,2,...,P), of one class- 
ification dimension (dimension ‘‘p’’) with the Q 
categories, designated by q(=1,2,...,Q), of a 
second way of classification which we name di- 
mension ‘‘q’’. The random variable thus ob- 
tained is denoted by Xpqrs and refers to the 
measurement made on the sth subject of the rth 
group of dimension ‘‘r’’ on the pth category of 
classification dimension ‘‘p’’ and the qt cate- 
gory of classification dimension ‘‘q’’. Now, we 





assume that Xpqrs 18 PQN-variate, normally dis- 
tributed with expected value a linear functionof 
fixed constants, 


E(Xpqrs) = p+a(l,p) + a(2,q) + a(3,r) 
+ B(1, pq) + A(2, pr) + (3, qr) 


+> (pqr) (1) 


where p is the general effect, 


a(1,p), a(2,q), and a(3,r) are the main ef- 
fects for the pth, qth, and rth categories 
of dimensions ‘‘p’’, ‘‘q’’, and ‘‘r’’, respec- 
tively, 


B(1,pq), 6(2,pr), and (3, qr) are the interac- 
tion effects for the pth and qth, pthandrth, 
qth and rth categories of dimensions ‘‘p’’ 
and “‘q”’, “pang “‘r’’, and “*q’’ and *‘r’’ 
respectively; 


Y(pqr) are the interaction effects for the pth, 
qth, and rth categories of dimensions ‘‘p’’ 
“7, ey", 


and again, as before 


23 eer 


; 22 Bh Re Fo Eye, 
R, ands =1,2,...,Mr. 


The variance of Xpqrs 18 further assumed to 
be 


vi Xpqrs| = 0° (2) 


and any two measurements are assumed to be 
correlated with covariance 


Cov[Xpaqrs, Xp' q' r's' | po* 


where p = O whenr=r' ands? s' or 
whenr/r' ands=s'. 


The logic of this last assumption is that two or 

more measurements on the same subjectare sto 
chastically dependent, i.e., they have a co vari- 
ance different from zero, whereas two measure- 
ments from different subjects are stochastically 

















256 


independent or have zero covariance. 


Outline of Analysis 





The analysis of the foregoing model requires 
that certain restrictions be placed on the param - 
eters. Let these be 


x a(1,p) = © a(2,q) = Znpa(3,r) = & A(1, pq) 
p q r p 


= YA(1,pq) = 2 A(2, pr) = XnyA(2, pr) 
q p r 


(4) 
= 2 A(3, qr) = XnrA(3, qr) = & ¥(pqr) 
q r p 


= Ly(pqr) = Unr¥(pqr) = 0. 
r 


In order to render the dependent multiple 
measurements statistically independent of one 
another, a modification of a transformation due 
to Nandi (7) is applied to the Xpqrs. Letting a 
and b be particular values of p andq, respec- 
tively, we employ the following transformation: 





Yabrs = J qq obe Xa(b+l)rs 
(a-1) Q b 
x 2 Xparg + = Xaqrs 
P 4q Pq q 
(a - 1)Q +b 


where a=1,2,...,(P), andb=1,2,...,(Q-1), 


2 Q 
- aQ 2 Xpars 
Yabrs = Jagy1 | X(arijirs P_4 sana 
aQ 
where a = 1,2,...,(P-1), and b = Q, (5) 
PQ 
. 2 Xpqrs 
Ypars* PQ PQ 


As a result of this transformation it may be 
shown that the covariance of any two transformed 
variates is given by 


Cov| Ypqrs: Yp' q' r's' | = 0 (6) 


In addition, the (PQ - 1)N transformed vari- 
ates, Yaprs, !.e., where p and q are not equal 


JOURNAL OF EXPERIMENTAL EDUCATION 










(Vol. 25 


to P and Q simultaneously, have the variance 





o; = V{ Yabrs] = (1-p)o? , (7) 








and the N remaining variates, YpQrg, where p 
and q are equal to P and Q simultaneously, which 
may be labeled as ZpQrs, have variance 


o3 = V[ Zpars] =[1 + (PQ-1) p]o* . (8) 


The Yabrs and the ZpQrs then, are two vari- 
ate sets normally and independently distributed 
within and between sets, but with different vari- 
ances. Accordingly, we write two expressions 
for the sum of squares related to each of these 
sets, respectively: 


(a) The Sum of Squares associated with Yaprs 
TTTZLY -E(Y )]?-Z2 5[Yp 
par : | pqrs pqrs)| ; E | Qrs 
- E(Ypa@rs)}* = 
(9) 


y ELE LL Xpqrs-¥. .rs-[a(1,p) +a (2,q) 
pars 


2 

+8 (1, pa) +8 (2, pr) +8(3, ar) + ¥(par)]} 

(b) The Sum of Squares associated with Zpers 
x = [ Zpers - E(Z )|*=P p o{x 

Ez PQrs PQrs)| Ro & (20 


(10) 
-( n+ a(3,r)} } ; 


It should be pointed out here that the sum of 
squares in the right-hand side of (9) includes all 
the parameters of (1) except p and a(3,r) while 
the right-hand side of (10) includes only these 
two parameters. 

Employing normal regression theory or 
equivalently by the method of least squares 
quantities (9) and (10) above provide estimates 
of the parameters involved in each set. By well- 
known methods we may use these estimates in 
order to generate tests of hypotheses on the pa- 
rameters. The quantities necessary toperform 
these tests are presented in usual analysis of var- 
iance form in Table I together with the related 
hypotheses. 

First, we present tests of hypotheses on the 
parameters Y(pqr), 8(1,pq), B(2,pr), 8(3, qr), 
a(1,p) and a (2,q) involved in (9) above. The hy- 
pothesis concerning the interaction ‘‘p’’ x ‘‘q’’ x 
“‘r’’, Hy): ¥(pqr) = 0, is tested by referring 


F(1) = M. 8. (“*p"" xq" x“r")/M. 8.[Error (1)] = 


S.S8. (‘‘p’’ x ‘*q”’ x **r’’)/(P-1)(Q-1)(R-1) 
S. S.[ Error (1)]/(N-R)(PQ-1) 


-1)(Q-1)(R-1), (N-R)(PQ-1), a], i.e., the upper 





to F[(P 





March, 1957) 


COLLIER - STUNKARD 


TABLE I 


ANALYSIS OF VARIANCE FOR DOUBLE CLASSIFICATION OF MULTIPLE MEASUREMENTS ON 
SUBJECTS IN UNEQUAL GROUPS OF A SINGLE CLASSIFICATION 








Source of 
Variation 


Hypothesis 
Tested 


Degrees of 


Sum 
Freedom ° 





Error (1) 


"p"x«"q"x"'r" 


y(pqr)=0 
:B(1,pq)=0 
6(2,pr)=£0 


:B( 5,qr)=0 


of Squares“ 





(N-R) (PQ-1) a - 
(P-1)(Q-1)(R-1) 
(P-1)(Q-1) 

(P-1) (R-1) 


(Q-1) (R-1) 


(b+ f) + 1 
(c+d¢e) + (g+he1) 
(g+hn) + J 
(g+1) + § 


(nei) + J 


:a(1,p)=£0 P-1 


a(2,q)=0 


Within 


Sub jects (PQ-1)N valle 





Error (2) 


r a(3,r)=#0 


tween 
Sub jects 





Total PQN-1 





No Mean Square column is presented since the hypothesis tests are ¢1i 
specifically elsewhere. However, the expected values for the squares 
of Error (1) and Error (2) are found to be o,© and o.*, respectively, as 
expressed in [7] and Fas ; “ 


ven 


nean 


* Tne lettered quantities are defined as follows: 


i (z “oars 


2 . 
) /n.), ces Ee 


2 
Koars) /PN, i= x 


With pwl,2,°--,P, qul,2,°°*,Q, rul,2,°°*,R, 8e1,2,° "on, and N= Zn. 
r 








258 


TABLE II 


RAW DATA FROM A RELEARNING EXPERIMENT 








Group 





Control (r=2) 


Experimental (r=1) 


List (p) 





JOURNAL OF EXPERIMENTAL EDUCATION 


(p) 


r=) 


- 


List (p) 











N 











Trial (q) 


Trial (q) 


> 
. 


Sub jec 


rial (q) 


tal 


al (q) 


4 
Tri 


+ 


Sub jec 


Trial (q) 


Trial (q) 


+ 
. 


Sub jec 


") 





Number 


") 





Number 





Number 





o 


et 


Nn 


Se 


oO¢ 


eq 


s=19 


10 


n 


oO 
N 


° 
a 


se) 


 * 


) 


ve) 


N 


21 


n 


a) 


10 


22 


@ 


w- 


N 


") 


NN 


2s 


O° 


r4 


10 


10 


72) 


~ 


10 


10 


10 


10 


Le] 


29 


ll 


" 


12 
13 
14 
15 


N 


10 


31 


a 


10 


0 


D 


16 
17 


” 


10 


10 


@ 


n 


Ww 


Q 
fo) 


10 


18 








(Vol. 25 














March, 1957) 


MEANS 


COLLIER - STUNKARD 





TABLE III 


BY LIST, TRIAL, AND GROUP FROM A RELEARNING EXPERIMENT 




















Trial (q) 
Group nn List (p) ~———~ —_---- 
1 2 3 All 
Experimental 34 l 2.21 5.24 7,65 5.03 
(r=l) 

2 2.7 5.94 7,82 5.51 
Both 2.49 5.59 7,74 5.27 

Control 18 l 5.47 5.50 8.00 5. Sé 


(r=2) 














260 





JOURNAL OF EXPERIMENTAL EDUCATION (Vol. 25 


TABLE IV 


COMPLETE ANALYSIS OF VARIANCE OF DATA FROM A RELEARNING EXPERIMENT 








Degrees 























Sum of Mean Test of 
Source of Variation of F 
Preedom Squares Square Hypothesis 
Error (1) 250 411.1645 1.6447  ------- 0 0 -2------ 
List x Trial x Group 2 0.1409 0.0704 <errsres Accepted 
("p" x“ ~~ x "p") 
Residual (1) 252 411.3054 1.6322  ------- 0 -----ee- 
List x Trial 2 4.7501 2.3750 1.455 Accepted 
te x“ 7? 
List x Group 1 0.0525 0.0525 re ereee Accepted 
te ig x “9") 
Trial x Group 2 5.7832 2.8916 1.772 Accepted 
is 4 "y" 
List ("p") 1 16.6153 16.6153 10.180 Re jected 
Trial ("q") 2 1,329.6269 664.9134 407.573 Re jected 
Within Sub jects 260 1,768.5534 
Error (2) 50 837.0512 16.7410 <occecs coccccce | 
Group ("r") 1 17.5769 17.5769 1.050 Accepted 
Between Subjects 51 854.6261 | 
| 
Total 31] 2,622.9615 








1, significance level of .01 was employed. 














March, 1957) 


a -point of the F distribution with (P-1)(Q-1) 
(R-1) and (N-R)(PQ-1) degrees of free- 
dom, since F(j) is, under H(1), distributed 
as Fi (P-1)(Q1N(R-1), (N-R)(PQ-1)]. 

If they (pqr) are zero, the hypothesis on the 
interaction ‘‘p’’ x ‘‘q’’, Hy B (1, pq) = 0, is test- 
ed by referring F(2) = M.S. (‘‘p’’ x ‘‘q’’)/M.S. 

[| Residual (1)] = 


S.S. (**p’’ x **q’’)/(P-1)(Q-1) 








COLLIER - STUNKARD 261 


so-called re-learning procedure was continued 
for three trials. The measurements recorded 
for each subject consisted of the number of ‘‘cor- 
rect’’ associations made during the testing on 
each list and trial. 

Before applying the analysis presented in Ta- 
ble I, it is first necessary to recognize that the 
following assumptions are made. The variance 





{S. 8. [Error (1)] +8. S. (“px “q’’ xr’) } J (P-1)(Q-1)(R-1) + (N-R)(PQ-1)] 


to F{(P-1)(Q-1), d.f.{ Residual (1)] =(P-1)(Q-1) 
(R-1) + (N-R)(PQ-1)],a:}. The two other hy- 
potheses of the interaction ‘‘p’’ x ‘‘r’’, Hy): 
8 (2,pr) = 0, and the interaction ‘‘q’’ x‘‘r’’, . 
6 (3,qr) = 0, are tested in analogous form as 
H(2), the numerator only differing. 

Concerning the main effect ‘‘p’’, the hypoth- 
esis, H(5):a (1,p) = 0, is tested by referring 
F(5) = M.S. (‘‘p’’)/M. S. [Residual (1)] to F{(P-1), 
d.f.[ Residual (1)],a}. Finally, the remaining 
hypothesis relative to the main effect ‘‘q’’, H(6): 
a (2,q) = 0 is tested in a parallel fashionas H(5). 

Testing hypotheses on parameters p anda (3, 
r) involved in (10), we consider first the hypoth- 
esis concerning the main effect ‘‘r’’, H(7):a (3, 
r)=0. This test consists of referring F(7)=M. 
S. (‘‘r’’)/MS.{ Error (2)] to F{ (R-1),(N-R),a]. 
Although the hypothesis on the general effect, 
H(g): » = 0, has not been listed specifically in 
Table I, this hypothesis may be tested by refer- 
ring F(g) = j/M.S.[{ Error (2)] to F{1,(N-R),a}. 

Special attention is called to the fact that hy- 
potheses Hi1) to Hig) representing the four inter- 
action effects and the two main effects are to be 
tested with a different estimate of error than is 
the main effect related to classification dimen- 
sion ‘‘r’’ used to group the subjects. 


An Application of the Analysis 





For an example of a specific problem in 
which the above type of analysis is appropriate 
we refer to a re-learning experiment by Carlson* 
in which he randomly assigned 52 female resi- 
dents of a college dormitory to two groups, ex- 
perimental (n, = 34), and control (nz = 18). Us- 
ing two lists of nonsense words, Carlson provid- 
ed the experimental group with special training 
during sleep while the control subjects were not 
so treated. This special training consisted of 
reading the ten words in List 1 with ‘‘correct’’ 
associates and the ten words in List 2 with cer- 
tain ‘‘incorrect’’ associates. After his training 
period, the twenty words with the ‘‘correct’’ as- 
sociates were read to both groups followed by an 
immediate testing in multiple choice form. This 





*Paymond Carlson, Dean of Students, Bemidji State Teachers College, Bemidji, Minnesota. 


of any observation is assumed to be o”, and any 
two observations taken on the same subject are 
assumed to be dependent with covariance equal 
to po*. Any two observations on different sub- 
jects are assumed to be independent and hence 
have zero covariance. 

The measurement taken on the sth member of 
the rth group of subjects on the qt trial of the 
pth list of words may be designated as Ypqrs, 
where p=1,2, q=1,2,3, r= 1,2, and s=1, 2, 
...,Mp With n, = 34 andn, = 18. The raw data 
for this experiment are presented in Table II, 
while the means of the measurements are to be 
found in Table II]. The basic values needed for 
the sums of squares in the analysis of variance 
outlined in Table I were computed and found to 
be as shown on the next page. Substitution of 
these values in the expressions of Table I togeth- 
er with the proper values (N = 52, P = 2, Q= 3, 
and R = 2) for computing the degrees of freedom, 
and proceeding to test the null hypotheses as out- 
lined permits presentation of the complete anal - 
ysis of variance in Table IV. 

Examination of the table of means (Table III) 
together with the results of the tests of the vari- 
ous hypotheses (Table IV) leads to the conclusion 
that List 2 is more difficult to learn than List 1, 
and that significant gains accrue from trial to 
trial. However, the experimental group does not 
perform significantly better or poorer than the 
control group, and none of the three ways of clas- 
sifying the measurements interacts withthe oth- 
ers, that is, the classifications of List (‘‘p’’), 
Trial (‘‘q’’), and Group (‘‘r’’) are independent of 
each other. 


Summary 


A model is set forth for a specific class of 
experimental designs in which subjects classi- 
fied into several groups with possibly unequal 
frequencies along one dimension are measured 
under all combinations of categories of a further 
two-way classification system. Because the mul- 
tiple measurements on each subject are logica|l- 
ly assumed to be dependent, they are trans - 





262 JOURNAL OF EXPERIMENTAL EDUCATION 























(Vol. 25 

a = 11, 864. 0000 b = 10, 615. 7843 c = 10, 592. 2308 d= 9,275. 2832 
e = 10, 594. 2255 f = 10,095. 6666 g= 9,257. 6538 h = 10, 570. 8654 
i= 9,258.6154 j= 9,241.0385 
formed into independent variates. The transfor - 5. Moonan, W. J. ‘‘Simultaneous Examination 
mation, an outline of the analysis, and the ap- and Method Analysis by Variance Algebra,’ 
propriate analysis of variance with the neces- Journal of Experimental Education, XXIII 
sary equational quantities to use in tes ting hy- (1955), pp. 253-57. 
potheses for the general case are presented. To 
illustrate the procedure, data and analysis fora 6. Moonan, W. J. ‘‘An Analysis of Variance 
special case of the design are also presented. Method for Determining the External and 

Internal Consistency of an Examination, ”’ 


Journal of Experimental Education, XXIV 











REFERENCES (1956), pp. 239-44. 
1. Collier, R.O., Jr. Experimental Designs 7. Nandi, H. K. ‘‘A Mathematical Set-up Lead- 
in Which the Observations Are Assumed ing to Analysis of a Class of Designs, ’’ 
to Be Correlated, unpublished Ph. D. Dis- Sankhya, VIII (1947 172-76. 
sertation, University of Minnesota, 1956. — ( )» PP 
" 8. Yates, F. ‘‘The Principles of Orthogonal- 
2. Halperin, Max. ‘‘Normal Regression The- ity and Confounding in Replicated Exp e ri- 
ory in the Presence of Intra-Class Corre- ments, ’’ Journal of Agricultural Science 
lation, ’’ Annals of Mathematical Statistics, XXIII (19 109-45, ‘ 
XXII (1951), pp. 573-80. *9), pe 
3. Hoyt, C. J.,Stunkard, C. L., et al. ““ACom- 9. Yates, F. ‘‘Complex Experiments, ’’ Sup- 
parison of Two Methods of Instruction in plement to the Journal of the Royal Statis- 
Beginning Drawing, ’’ Journal of Experi- tical Society, IT (1935), pp. 181-223. 
mental Education, XX (1952), pp. 385-79. 
10. Yates, F. The Design and Analysis of Fac- 


4. Kogan, L. S. ‘‘Analysis of Variance—Re- 
peated Measurements, ’’ Psychological Bul- 
letin, XLV (1948), pp. 131-43. 











torial Experiments, Technical Communica- 


tion No. 35 (Harpenden, England: Imperial 
Bureau of Soil Science, 1937). 











Three other articles that were scheduled for the 
March issue of the Journal of Experimental Edu- 
cation will be published In the March issue 





























A Revised 
Educational Statistics Primer 


Written by students for students 


@ Statistics self-taught for the non-mathematical student 


@ Who desires a more intelligent reading of technical pa- 
pers in education, and 


oo ee 
speed and accuracy 


This revised edition of statistics explains and illustrates 
calculation procedures for: 


Measvres of central position 
Measures of variability 

Simple correlation and regression 
Errors of measurement and sampling 
Multiple correlations 


Price $2.00 


DEMBAR PUBLICATIONS, INC. 
Dept. B-303 E. Wilson St. 
MADISON 3, WISCONSIN 





Specifications for Manuscripis 
for the 


JOURNAL OF EDUCATIONAL RESEARCH 
+e. and the... 


JOURNAL OF EXPERIMENTAL EDUCATION 


1. All manuscripts must be typewritten, double spaced, and on one side 
of the sheet only. Mimeographed and ditto sheets are acceptable only 
when very clearly printed. 


2. All unusual symbols or formulae must be very clearly typed or hand 
printed in black ink. To avoid costly printers’ composition charges it 
may be necessary for us to make cuts of difficult matter, or to print 


your material by the photo-offset lithography method. The latter means 
photographing your actual copy. It is expensive to have material re- 
drawn by our own artists, and retracing or duplicating increases the 
hazards of error. See that your copy is correct and complete as you 
wish to have it reproduced. The men who work on your manuscripts 
are not trained to understand the working symbols and language of 
your technical field. 


8. The same restrictions and requirements as in Paragraph 2 apply to all 
drawings, graphs or other illustrated materials,—they must be neatly 
done, in black ink, on bond paper or tracing cloth suitable for repro- 
duction. Remember our magazines are printed in black ink only. Color 
graphs should be changed by the author to provide different kinds of 
shading for the different areas. For example: diagonal lines for red, 
vertical lines for blue, etc. Provide a key. 


4. All tables, graphs, etc., on sheets by themselves must be properly labeled 
and identified in relation to the written copy of the manuscript. 


5. Footnotes must be complete as to author, title, place of publication, 
publisher, date and pages. They must be numbered consecutively 
throughout the article. 


6. Bibliographical notes must be complete and arranged alphabetically. 


The cooperation of all prospective authors in following these rules is 
earnestly required. It is difficult to produce technical journals accurately, 
neatly, and on time under the best conditions. Promptness in printing, 
economy, and accuracy will be promoted by carefully prepared manuscripts. 





